Document [original]

Distributed Function Approximation over Noisy Channels

With Reliability and Security Guarantees for Applications in Wireless

Networks

vorgelegt von

M. Sc.

Matthias Frey

https://orcid.org/0000-0003-3016-2644

an der Fakult¨

at IV - Elektrotechnik und Informatik

der Technischen Universit¨

at Berlin

zur Erlangung des akademischen Grades

Doktor der Naturwissenschaften

- Dr. rer. nat. -

genehmigte Dissertation

Promotionsausschuss:

Vorsitzender: Prof. Giuseppe Caire, Ph.D.

Gutachter: Prof. Dr.-Ing. Slawomir Sta´nczak

Gutachter: Prof. Dr. Michael C. Gastpar

Gutachter: Dr. Jingge Zhu

Tag der wissenschaftlichen Aussprache: 12. Dezember 2022

Berlin 2023

Abstract

Over-the-Air (OTA) computation is the problem of computing functions of distributed

data over a wireless channel without transmitting the entirety of the data to a central

point. By avoiding such costly transmissions, OTA computation schemes can achieve a

better-than-linear (depending on the function, often logarithmic or even constant) scaling

of the communication cost as the number of transmitters grows. Among the most common

functions computed OTA are linear functions such as weighted sums. In this work, we

propose and analyze a method for the approximation of functions of distributed arguments

over noisy channels. This method can be used as an analog OTA computation scheme

for a class of functions that contains linear functions as well as some nonlinear functions

such as p-norms of vectors. We prove error bound guarantees that are valid for fast-fading

channels and all distributions of fading and noise contained in the class of sub-Gaussian

distributions. This class includes Gaussian distributions, but also many other practically

relevant cases such as Class A Middleton noise and fading with dominant line-of-sight

components. In addition, there can be correlations in the fading and noise so that the

presented results also apply to, for example, block fading channels and channels with

bursty interference. We do not rely on any stochastic characterization of the distributed

arguments of the OTA computed function; in particular, there is no assumption that these

arguments are drawn from identical or independent probability distributions. Our analysis

relies on tools from high-dimensional statistics which we adapt so that they are applicable

to the scenario at hand. The resulting error guarantees are nonasymptotic and therefore

provide error bounds that are valid for a finite number of channel uses.

OTA computation has a huge potential for reducing communication cost in applications

such as Machine Learning (ML)-based distributed anomaly detection in large wireless

sensor networks. We show this potential with two examples of how our OTA computation

scheme can be used to vastly increase the efficiency of Vertical Federated Learning (VFL)

over a wireless channel. We also illustrate the efficiency gain with numerical simulations

for a few example cases.

Then, we move on to propose a new method to protect OTA computation schemes

against passive eavesdropping. Our method uses a friendly jammer whose signal is –

contrary to common intuition – stronger at the legitimate receiver than it is at the eaves-

iii

dropper. It works for a large class of analog OTA computation schemes and we give details

on the special case of computing an arithmetic average over an Additive White Gaussian

Noise (AWGN) channel. The derived secrecy guarantee translates to a lower bound on

the eavesdropper’s mean square error while the question of how to provide operationally

more significant guarantees such as semantic security remains open for future work. As

key ingredients in proving the security guarantees, we propose and prove generalizations

of known results on channel resolvability and coding for compound channels for the case

of continuous channel input and output alphabets.

Zusammenfassung

Funktionsberechnung im Funkkanal ist eine Methode, Funktionen verteilter Daten ohne

eine vollst¨

andige ¨

Ubertragung der Daten zu einem zentralen Punkt zu berechnen. Indem

solche ¨

Ubertragungen vermieden werden, kann erreicht werden, dass der Ressourcenver-

brauch weniger schnell als linear mit der Anzahl der Sender w¨

achst. Abh¨

angig von der zu

berechnenden Funktion kann dieses Wachstum dann in vielen F¨

allen logarithmisch oder

sogar konstant sein (d.h. die zur Funktionsberechnung n¨

otigen Kanalressourcen wachsen

uberhaupt nicht, wenn die Anzahl der Sender w¨

achst). Zu den am h¨

aufigsten im Funk-

kanal berechneten Funktionen geh¨

oren lineare Funktionen wie z.B. gewichtete Summen.

In der vorliegenden Arbeit f¨

uhren wir eine Methode zur verteilten Approximation von

Funktionen in verrauschten Kan¨

alen ein. Diese Methode kann als Verfahren zur analogen

Funktionsberechnung im Funkkanal genutzt werden. Sie ist auf eine Klasse von Funktionen

anwendbar, die zum einen alle linearen Funktionen, zum anderen aber auch einige nicht-

lineare Funktionen wie die p-Norm von Vektoren enth¨

alt. Wir zeigen Fehlerschranken f¨

unser Verfahren, die in Kan¨

alen mit Fast Fading gelten, wobei sowohl die Verteilung des

Fadings als auch die des Rauschens zur Klasse der sub-Gauß’schen Wahrscheinlichkeitsver-

teilungen geh¨

oren m¨

ussen. Diese Klasse enth¨

alt nicht nur alle Normalverteilungen, sondern

auch viele andere in der Praxis relevanten Verteilungen wie z.B. Class-A-Rauschen nach

Middleton und Fadingverteilungen mit einer dominanten Sichtkomponente. Zudem sind

unsere Fehlerschranken auch dann noch g¨

ultig, wenn es Korrelationen in der Verteilung des

Fadings und/oder Rauschens gibt, sodass unsere Ergebnisse beispielsweise auch auf Kan¨

ale

mit Blockfading oder geb¨

undelt auftretender Interferenz anwendbar sind. Dabei verwen-

den wir keinerlei stochastische Charakterisierung der Argumente der ¨

uber den Funkkanal

zu berechnenden Funktion. Dies bedeutet insbesondere, dass keine Annahme ¨

uber identi-

sche Verteilung oder Unabh¨

angigkeit dieser Argumente n¨

otig ist. Unsere Analyse baut auf

Werkzeugen aus der hochdimensionalen Statistik auf, die wir derart anpassen, dass sie im

vorliegenden Szenario anwendbar sind. Die sich daraus ergebenden Fehlerschranken sind

nichtasymptotisch, d.h. sie sind f¨

ur jede beliebige (endliche) Anzahl von Kanalnutzungen

g¨

ultig.

Funktionsberechnung im Funkkanal bietet ein riesiges Potenzial, in Anwendungen wie

z.B. der auf maschinellem Lernen basierenden Anomaliedetektion in großen Sensornetzen

die zur Kommunikation n¨

otigen Ressourcen drastisch zu reduzieren. Wir veranschaulichen

dieses Potenzial, indem wir zwei Beispiele ausf¨

uhren, die zeigen, wie unsere Methoden zur

Funktionsberechnung im Funkkanal die Effizienz von vertikalem f¨

oderierten Lernen enorm

steigern k¨

onnen. Wir illustrieren diesen Effizienzgewinn außerdem f¨

ur einige ausgesuchte

Spezialf¨

alle anhand numerischer Simulationen.

Anschließend f¨

uhren wir ein neues Verfahren ein, um Funktionsberechnungen im Funk-

kanal gegen passive Lauscher zu sch¨

utzen. Unsere Methode basiert auf der Kooperation der

legitimen Kommunikationspartner mit einem Jammer, dessen Signal – entgegen den nor-

malerweise in diesem Zusammenhang gemachten Annahmen – beim legitimen Empf¨

anger

in einer gr¨

oßeren Signalst¨

arke ankommt als beim Lauscher. Sie funktioniert f¨

ur eine große

Klasse von Verfahren zur Funktionsberechnung ¨

uber den Funkkanal, und wir f¨

uhren den

Spezialfall der Berechnung eines arithmetischen Mittelwerts ¨

uber einen Kanal mit addi-

tivem weißen normalverteilten Rauschen detailliert aus. Dabei erhalten wir eine Sicher-

heitsgarantie, die den durchschnittlichen quadratischen Fehler des Lauschers nach unten

begrenzt. Die Frage, wie st¨

arkere Sicherheitsgarantien (z.B. semantische Sicherheit) gege-

ben werden k¨

onnen, bleibt dabei f¨

ur zuk¨

unftige Forschungsarbeiten offen. Die wichtigsten

informationstheoretischen Zutaten f¨

ur den Beweis der Sicherheitsgarantie sind Verallge-

meinerungen bekannter Ergebnisse zur Resolvability von Kan¨

alen und zur Kommunikation

uber Compound-Kan¨

ale f¨

ur den Fall kontinulierlicher Ein- und Ausgabealphabete, die wir

im Rahmen dieser Arbeit ebenfalls beweisen.

Acknowledgments

I would like to thank Prof. Dr.-Ing. Slawomir Sta´nczak for the opportunity to pursue a

Ph.D. degree as a member of his research group at Technische Universit¨at Berlin, and for

the support and scientific freedom I had during that time. I am also particularly indebted

to Dr. Igor Bjelakovi´c who has taken a lot of time and effort to guide and support me in

my research work for this thesis, and beyond.

Moreover, I would like to thank Prof. Dr. Michael C. Gastpar and Lecturer Dr. Jingge

Zhu for serving as referees for this dissertation.

Finally, I gratefully recognize my co-authors Navneet Agrawal, Dr.-Ing. Zoran Utkovski,

Patrick Agostini, Miguel A. Gutierrez-Estevez and Daniel Sch¨aufele for their collabora-

tion, as well as all my current and former colleagues at Technische Universit¨at Berlin

and Fraunhofer Heinrich Hertz Institute. It was a pleasure to share many stimulating

discussions with them and I am appreciative of this joyful work and research environment.

vii

The work contained in this thesis has been published in various venues. Chapter 1 and the

introductory paragraphs of Chapter 2 and Chapter 4 are based on the publication [7] which

is reproduced with permission from Springer Nature. The remaining part of Chapter 2

IEEE. The remaining part of Chapter 4 and part of Section 1.5 are based on [1,5] which

and may be ©IEEE in the future.

In reference to IEEE copyrighted material which is used with permission in this the-

sis, the IEEE does not endorse any of Technische Universit¨at Berlin’s products or ser-

vices. Internal or personal use of this material is permitted. If interested in reprint-

ing/republishing IEEE copyrighted material for advertising or promotional purposes or for

creating new collective works for resale or redistribution, please go to http://www.ieee.

org/publications_standards/publications/rights/rights_link.html to learn how

to obtain a License from RightsLink. If applicable, University Microfilms and/or ProQuest

Library, or the Archives of Canada may supply single copies of the dissertation.

Contents

1 Introduction 1

1.1 NomographicFunctions ............................. 1

1.2 Distributed Function Approximation in Noisy Channels: An Example . . . 4

1.3 Over-the-Air Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Digital Over-the-Air Computation . . . . . . . . . . . . . . . . . . . 6

1.3.2 Analog Over-the-Air Computation . . . . . . . . . . . . . . . . . . . 8

1.4 Applications of Over-the-Air Computation . . . . . . . . . . . . . . . . . . . 10

1.4.1 Distributed Machine Learning . . . . . . . . . . . . . . . . . . . . . . 10

1.4.2 Consensus over Wireless Channels . . . . . . . . . . . . . . . . . . . 13

1.5 Contribution and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Distributed Function Approximation in Fading and Correlated Channels 19

2.1 SystemModel................................... 20

2.1.1 Sub-Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . 20

2.1.2 Channel and Correlation Model . . . . . . . . . . . . . . . . . . . . . 21

2.1.3 Discussion of the System Model Assumptions . . . . . . . . . . . . . 23

2.2 ProblemStatement................................ 24

2.2.1 Distributed Approximation of Functions . . . . . . . . . . . . . . . . 24

2.2.2 The Class of Functions to be Approximated . . . . . . . . . . . . . . 25

2.3 MainResult.................................... 27

2.3.1 Special Cases of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 Sharpness of the Bound in Theorem 2 . . . . . . . . . . . . . . . . . 31

2.4 NumericalResults ................................ 32

2.5 ProofofTheorem2................................ 35

2.5.1 Pre-Processing .............................. 35

2.5.2 Post-Processing.............................. 36

2.5.3 TheErrorEvent ............................. 37

2.5.4 PerformanceBounds........................... 38

2.6 Preliminaries on Sub-Gaussian and Sub-Exponential Random Variables . . 43

2.7 ProofofLemmas1and2 ............................ 45

Contents

3 Applications of Distributed Function Approximation to Vertical Federated

Learning 51

3.1 Support Vector Machines with Additive Kernels for Regression and Classi-

fication ...................................... 52

3.2 Model-Agnostic Approach to Over-the-Air-Computed Classifiers . . . . . . 54

3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers . . 57

4 Security in Over-the-Air Computation 67

4.1 PriorWork .................................... 68

4.2 SystemModel................................... 70

4.2.1 Distributed Approximation of Functions . . . . . . . . . . . . . . . . 70

4.2.2 Secrecy Extension to Distributed Approximation of Functions . . . . 71

4.2.3 Special case K=1............................ 73

4.3 Specialization to the Additive White Gaussian Noise Channel . . . . . . . . 74

4.4 MainResults ................................... 77

4.5 Implications of the Main Results . . . . . . . . . . . . . . . . . . . . . . . . 82

4.5.1 Feasibility of Channel Approximation . . . . . . . . . . . . . . . . . 82

4.5.2 Back to the Additive White Gaussian Noise case: Calculating Mean

Square Error Security Guarantees . . . . . . . . . . . . . . . . . . . 85

4.6 Proofs ....................................... 87

4.6.1 Statistical Preliminaries for the Proof of Lemma 12 . . . . . . . . . . 87

4.6.2 ProofofTheorem4 ........................... 89

4.6.3 ProofofTheorem5 ........................... 91

4.6.4 ProofofTheorem6 ........................... 96

4.6.5 Cost Constraint in Compound Channel Coding and Resolvability . . 101

Publication List 107

Bibliography 109

Notations and Symbols 119

Abbreviations 123

Index 125

xii

1 Introduction

In this thesis, we study the approximation of a function fof distributed arguments

s1, . . . , sK. Namely, we consider terminals A1,...,AK,Bwhich are connected by a multiple-

access channel (MAC). The MAC is described as a stochastic kernel Wwhich randomly

maps every possible tuple of inputs (T1, . . . , TK) at A1,...,AKto an output Yat B.

Each terminal Akholds some value skfrom which it generates a sequence of channel in-

puts TM

k= (Tk,1, . . . , Tk,M ). The channel is then invoked Mtimes and generates, for

each input tuple (T1,m, . . . , Tk,m), an output symbol Ym. At terminal B, the sequence

YM= (Y1, . . . , YM) is processed to yield a value f

˜. A distributed function approximation

scheme should have the property f

˜≈f(s1, . . . , sK). The meaning of “≈” can vary, but it

usually means that a suitably defined distance1between f

˜and f(s1, . . . , sK) is small and

approaches 0 as Mtends towards infinity. The channels we consider in this work mostly

have complex or real inputs and outputs, and they are of the form

∑︂

k=1

HkTk+N. (1.1)

In channels of this form, H1, . . . , HKare called the fading coefficients, and Nis called

the (additive) noise. The fading coefficients can in some cases be deterministic, but in

general, both the fading and the noise follow a random distribution. It is this randomness

which makes the channel output Y“noisy” in the sense that in general, it is not possible

to recover exact information about T1, . . . , TKor their sum from the observed value of Y.

1.1 Nomographic Functions

Given the channel (1.1), it is natural to ask which functions can be approximated from

distributed arguments over such channels. In case there is no fading or noise (i.e., N= 0

and H1=··· =HK= 1 almost surely), this turns out to be the class of nomographic

functions which we discuss in this section. The connection between nomographic functions

and distributed function approximation was first observed in [GS13] and has been discussed

1This distance does not need to be a metric.

1 Introduction

and analyzed in [GBS13] in more detail than we can in the following summary. Here and

in the rest of the thesis, Rdenotes the set of real numbers.

Definition 1. Anomographic representation of a function f:RK→Rconsists of

functions f1, . . . , fK, F :R→Rsuch that

∀a1, . . . , aK∈R:f(a1, . . . , aK) = F(︄K

∑︂

k=1

fk(ak))︄.(1.2)

A function f:RK→Rwhich has a nomographic representation is called a nomographic

function.

It has been noted in [Buc76, Theorem 8] that every function is nomographic according

to this definition. We state a version of this result that fits with Definition 1. Since it

illustrates the arguments below very well, we also give a full proof, based on the same idea

as in [Buc76].

Theorem 1. (adapted from [Buc76, Theorem 8]). Every function f:RK→Ris nomo-

graphic.

Proof. We first fix an arbitrary bijection ϕ:R→(0,1). An example of a possible choice

ϕ:a↦→ ⎧

⎪

⎨

⎪

⎩

2·1

a+1 , a ∈(0,∞)

2·(︂1 + 1

|a|+1)︂, a ∈(−∞,0)

2, a = 0.

(1.3)

Next, we define for every a∈(0,1) a sequence of digits ca,1, ca,2,··· ∈ {0,...,9}such

that2

a= 0.

dec

ca,1ca,2. . . := ∞

∑︂

i=1

ca,i ·10−i.(1.4)

We make the choice for the sequence ca,1, ca,2, . . . unique by requiring that it has to

contain infinitely many non-zero elements. Let, for all k∈ {1, . . . , K},

fk(a) := 0.

dec

0. . . 0

⏞⏟⏟⏞

k−1

cϕ(a),10. . . 0

⏞⏟⏟⏞

K−1

cϕ(a),20. . . 0

⏞⏟⏟⏞

K−1

cϕ(a),30. . . 0

⏞⏟⏟⏞

K−1

. . . . (1.5)

2Of course, there is nothing special about base 10 here, and in fact, [Buc76] uses dyadic representations.

We have chosen the base 10 here so that our representation coincides with the usual decimal notation

of numbers.

1.1 Nomographic Functions

Define ψ1, . . . , ψK, F : (0,1) →Rby

ψk: 0.

dec

b1b2. . . ↦→ ϕ−1⎛

⎝0.

dec

bkbk+Kbk+2K. . .⎞

⎠(1.6)

F:a↦→ f(ψ1(a), . . . , ψK(a)) .(1.7)

It is clear from our construction that the maps

(a1, . . . , aK)↦→

∑︂

k=1

fk(ak) (1.8)

a↦→ (ψ1(a), . . . , ψK(a)) (1.9)

are inverses of each other and therefore, (1.2) is satisfied, concluding the proof that fis

nomographic.

In order to use the nomographic representation of a function in a wireless communication

system, the inner functions f1, . . . , fKshould be computed at the transmitter before the

actual transmission, while the outer function Fshould be implemented and evaluated

at the receiver. Therefore, f1, . . . , fKare sometimes referred to as the pre-processing

functions while Fis called a post-processing function. The summation is performed by

the wireless channel due to its superposition property. If the receiver has access to f1(a1)+

···+fK(aK), then from (1.8) and (1.9), it is clear that a full reconstruction of a1, . . . , aK

is possible and in fact, this full reconstruction is used as an intermediate step in post-

processing. However, such an approach does not generalize to noisy channels in an obvious

way. Indeed, in (1.5) we can see that arbitrarily significant3digits of the transmitted values

can be hidden in digits of arbitrarily low significance in the real number that is transmitted

over the channel and therefore, even a channel noise of extremely low power can cause

arbitrarily strong disruptions. Of course, this is due to the specific construction of Fused

in the proof of Theorem 1, but we can expect that in general any strong discontinuity of

Fcan cause problems of this kind when the argument of Fis noisy.

It appears therefore that in order to apply a nomographic representation to a distributed

function approximation problem in the presence of noise, Definition 1 is not strong enough

and we need to impose additional constraints on the functions f1, . . . , fK, F. A famous

result [Arn57, Kol57] states that every continuous function f:RK→Rcan be written

3The significance of a digit in the decimal representation is an indication of how strongly a change of the

digit influences the value of the represented number. For instance, the significance of the digit ca,i in

(1.4) is 10−i, since the digit is multiplied with this number to determine the value of a.

1 Introduction

as a sum of 2K+ 1 functions with continuous nomographic representations,4giving a

positive answer in part to the question posed by Hilbert as the thirteenth problem in

his list of unresolved mathematical problems of the 20th century [Hil00]. If there was a

result implying that for every algebraic function, there is a nomographic representation

consisting only of algebraic functions, this would give a positive answer to the as-of-yet

unresolved part of Hilbert’s thirteenth problem. We can therefore expect that proving

such a result would be very hard.5Another result worth noting in this context is that

the set of functions with a continuous nomographic representation is nowhere dense in

the space of continuous functions [Buc82]. This provides another piece of evidence that

generic nomographic representations suitable for distributed function approximation may

not exist.

1.2 Distributed Function Approximation in Noisy Channels: An

Example

From an information-theoretic viewpoint, channels of the form (1.1) with no fading or noise

have infinite Shannon capacity. Therefore, they are not realistic in the sense that they

cannot be expected to sufficiently accurately model a real-world communication channel.

If we focus on finite-capacity channels, however, we can make some basic information-

theoretic arguments to further motivate the development of schemes in which f(s1, . . . , sK)

is directly approximated at Binstead of transmitting the values s1, . . . , sKto Bseparately

(from which f(s1, . . . , sK) could easily be computed at B).

If all values s1, . . . , sKare made available at B, this means that the number of uses of

the channel is lower bounded by the quotient of the Shannon entropy of s1, . . . , sKand

the sum-rate capacity of the channel. On the other hand, if the entropy of f(s1, . . . , sK) is

significantly smaller than that of s1, . . . , sK, then a scheme which only makes f(s1, . . . , sK)

available at Bis not subject to the same fundamental bound. The following example

illustrates that this difference of entropy can be significant.

Example 1. (from [7]). Suppose that Knodes send their data s1, . . . , snto a single

receiver through a MAC. For simplicity, we assume that each skis an independent random

variable uniformly distributed over S={0,1}. Now if the receiver reconstructs each of

these variables, then the entropy or the amount of information available at the receiver

is ∑︁K

k=1 H(sk) = Klog 2 nats where log(·)is the natural logarithm with Euler’s number

4A continuous nomographic representation is a nomographic representation that consists only of contin-

uous functions f1,...,fK, F.

5Hilbert even hypothesized that the correct answer to the question would be negative [Hil00,Hil27], which

was, however, partly disproven in [Arn57,Kol57].

1.3 Over-the-Air Computation

0 5 10 15 20 25 30

Entropy in nats

H(s1, . . . , sK)

H(∑︁K

k=1 sk)

Figure 1.1: Plot of sum vs. tuple entropy in Example 1.

e≈2.71828 as the basis, H(sk) := −∑︁s′∈S psk(s′) log psk(s′)is the Shannon entropy6

and psk:S ↦→ [0,1] is the probability mass function of sk. This means that the nodes

have to transmit Klog 2 nats (or, equivalently, Kbits) to the receiver. Therefore, if the

capacity of the communication channel is 1bit per channel use, then Kchannel uses

are necessary to convey the full information to the receiver.7Now we assume that the

receiver is only interested in f(s1, . . . , sK) = ∑︁K

k=1 skwhich can be easily computed from

s1, . . . , sK. By the data processing inequality [EGK11, Section 2.3], this operation cannot

increase the amount of information. In fact, the entropy of the function is H(∑︁K

k=1 sk) =

Klog 2 −∑︁K

k=1 (︁K

k)︁2−Klog (︁K

k)︁which is strictly smaller than Klog 2 for all K≥2. This

means that instead of transmitting Kbits that are necessary to reconstruct each sk, the

agents can send significantly less information to the receiver if its objective is to compute

the sum function f(s1, . . . , sK). In Fig. 1.1, it can be seen that this difference is quite

pronounced even for moderately large values of K.

1.3 Over-the-Air Computation

The approximation of functions of distributed arguments over channels of the form (1.1) is

of particular interest in the context of wireless communications. Here, use of the channel

corresponds to a concurrent transmission of waveforms from all transmitters, and the

6We use the convention 0 ·log(1/0) := 0 in the definition.

7In the case of orthogonal channel access, it is necessary to establish Kindependent (interference-free)

communication channels, where each of these has the capacity of 1 bit per channel use.

1 Introduction

receiver observes a noisy, superimposed version of these waveforms. In the context of

this application, distributed function approximation schemes can be seen as instances

of a class of schemes commonly called Computation over MAC (CoMAC), AirComp or

OTA computation. The goal of these schemes is to obtain a scaling behavior of the

communication cost in the number of transmitters that is better than the linear growth8

that would ensue from a separation of source and channel coding. Therefore, such schemes

exhibit the inherent property that the receiver is unable to fully reconstruct all of the

transmitted information.

The idea of a scheme that allows a receiver to reconstruct directly a combined form of

two messages, but not the original messages themselves, can be traced back to [KM79]

where a source coding problem is formulated in which it is the receiver’s task to reconstruct

a sequence of modulo-2 sums of encoded bits. An uncoded analog scheme for obtaining a

noisy estimate of a function of transmitted values with an application to wireless sensor

networks has appeared in [GV03] and is, to the best of our knowledge, the first work that

proposes a joint source-channel approach to OTA computation.

The authors in [GV03] take an analog approach in which a certain amount of noise

is tolerated in the received value and the function is computed only once.9This is in

contrast with a class of digital schemes that are closer to [KM79] in the sense that they

also consider functions with finite domains and typically give error guarantees for a large

number of repeated function computations.

1.3.1 Digital Over-the-Air Computation

In digital OTA computation, the function that is to be computed maps between discrete

sets. The computation is carried out repeatedly, and the objective of the corresponding

coding scheme is that the probability of a decoding error approaches zero as the number

of repetitions tends to infinity.

More formally, [NG07] introduces the problem of digital computation coding in the

following way:

Definition 2. Adigital computation coding problem consists of the following:

•A MAC Wwhich maps channel inputs T1, . . . , TKranging over the input alphabets

T1,...,TKto a channel output Ywhich ranges over the channel output alphabet Y.

8If the expense necessary for coordination and scheduling is also considered, this growth can even be

superlinear.

9The function can be computed multiple times since the scheme can simply be repeated, however, the

individual instances do not take advantage of the repeated computation.

1.3 Over-the-Air Computation

•An objective function

f:S1×···×SK→ S,(1.10)

where S1,...,SK,Sare finite sets.

•A probability distribution on S1×···×SK.

The idea is that, given this problem, the transmitters encode their messages s1, . . . , sK

as sequences of channel inputs in such a way that the receiver can, with high probability

of success, reconstruct f(s1, . . . , sK) without necessarily being able to draw any further

information about s1, . . . , sK.

Definition 3. An (n, M, ε)-code for a given digital computation coding problem consists

of:

•for each k∈ {1, . . . , K}, an encoder

k:Sn

k→ TM

k(1.11)

•adecoder

DM:YM→ Sn(1.12)

such that if the sequence of channel inputs is determined by TM

k:= EM

k(sM

k), the error

probability at the receiver satisfies

P(︂DM(YM)= (f(s(1)

1, . . . , s(1)

K), . . . , f(s(n)

1, . . . , s(n)

K)))︂≤ε. (1.13)

These notions can then be used to define the analog of rate and capacity in classical

source or channel coding problems.

Definition 4. The computation rate of an (n, M, ε)-code is defined as the ratio n/M.

A computation rate Ris called achievable if there is a sequence of (n, M, ε)-codes of

computation rate Rwhere M→ ∞ and ε→0. The computation capacity is the supremum

of all achievable computation rates.

This framework is extended by allowing the alphabets S1,...,SK,Sto be infinite and

then characterizing the rate-distortion trade-off. In any case, the computation coding

problem combines source and channel coding because the encoders simultaneously remove

redundancy from the sources and protect the transmission against channel noise. The

authors of [NG07] note examples where the rate that separate source and channel coding

can achieve is strictly less than the computation capacity.

1 Introduction

In the setting with finite alphabets, the typical objective function considered is addition

in a finite field, and the main application noted by the authors is physical layer net-

work coding. This idea was seminal to a lot of follow-up research (e.g., [ZNGE09, NG11,

OZE+11, NCNC16, GJRM+16]) which has expanded upon and refined the idea of using

OTA computation as a means to increase the efficiency of network coding. Notably, there

is also a work [GBS14] which proposes schemes that use digital computation codes in con-

junction with a quantizer to compute functions that are of interest in other applications,

such as the arithmetic mean, the geometric mean and the Euclidean norm.

1.3.2 Analog Over-the-Air Computation

The framework of digital computation codes is promising and its applications to network

coding are highly relevant as they can realize impressive performance gains in wireless

networks. However, is also has downsides in the context of other applications:

•The notion of computation capacity is an asymptotic one valid only for block lengths

tending to infinity. While finite-blocklength results are certainly conceivable, it is

nonetheless an inherent property of any approach involving digital coding that a

certain number of repeated function computations is necessary in order to guarantee

a reasonably low probability of decoding error. This can be problematic in applica-

tions where only a few computations are necessary or where protocols are used in

which the roles of transmitters and receivers change frequently with only very few

computations being done between these changes.

•To the best of our knowledge, the only known digital coding schemes which can deal

with channel fading compute sums over finite fields for the application of network

coding. Examples of functions that existing digital schemes cannot compute over

fading channels include weighted sums which have a high relevance in the context

of OTA ML, as well as maxima and various kinds of averages which are important

in the context of consensus algorithms and control systems.

•The digital coding schemes can only deal with discrete messages. If real (or floating

point) numbers are processed in a certain application, a quantizer needs to be added

to the system. Since quantization is a form of source coding, this is somewhat in

contrast with the observation that joint source-channel approaches are necessary to

achieve optimum system performance.

A way to make OTA computation applicable where these disadvantages hinder the use

of digital schemes is to process analog input values directly into an electromagnetic signal

without first going through a sequence of bits (or other discrete values) as an intermediary

1.3 Over-the-Air Computation

step. A striking observation in this context is that a standard wireless channel actually

performs a summation of the transmitted signals (which, through their inphase-quadrature

(IQ) representations, can be seen as points in Euclidean space). This opens the door to the

computation both of weighted sums and (as a special case) arithmetic averages, which we

have noted above are very relevant functions both for OTA ML and consensus algorithms.

There are two important research questions that these observations directly raise:

•If we were able to compute real function values in an analog system without error, this

would in the point-to-point case degrade to a possibility to losslessly transmit a real

number through the wireless channel which would imply infinite Shannon capacity

of the channel. Since this is known to be unrealistic for any real-world channel, we

can immediately conclude that a certain amount of noise in the computed function

values is unavoidable in any kind of analog OTA computation scheme. But is it

possible to control the strength of the noise, for instance by providing tail bounds

for its magnitude?

•We can expect from the structure of the wireless channel that it can compute sums

in Euclidean space, but can we, with the use of suitable pre- and post-processing

schemes, compute a larger class of functions OTA?

A pragmatic way to proceed in light of these difficulties is to attempt to find a subclass of

functions that is small enough to permit nomographic representations which are suitable

for use with noisy communication systems and at the same time large enough to contain

most functions of interest in practical OTA computation problems.

There are several prior works that propose approaches to the OTA computation prob-

lem for functions particularly relevant to applications for consensus problems in wireless

networks and ML over wireless channels. [GBS13] systematically explores the question

of what types of functions can efficiently be OTA computed with analog schemes, also

taking into account many system-level aspects such as the usage of analog OTA com-

putation schemes in large wireless networks with changing topologies. [GS13] presents a

scheme that is able to deal with imperfect synchronization and the presence of fading in

OTA computation; extensive theoretical analyses for the asymptotic case is provided for

the arithmetic and geometric mean functions. [GS14] presents pre- and post-processing

schemes and an asymptotic analysis for the approximation of functions over a fast fading

channel with noise in the case of multiple receive antennas. The work covers the case of no

instantaneous channel state information (CSI) at the transmitter or receiver as well as two

different types of instantaneous CSI at the transmitter. In [RGS16], under the assumption

of known fading coefficients at the transmitter, a similar scheme is used for computing the

sign of a weighted sum which is the decision function of a linear Support Vector Machine

1 Introduction

(SVM) used for classification. As a result, the authors obtain a distributed binary classi-

fication scheme that is highly efficient in massively-sized wireless networks. In the more

recent work [LZLV20], under the assumption that the sources are independent and the

channel state is known at both the receiver and the transmitter, the authors derive analog

OTA computation schemes for sums that are optimal in terms of mean square error. In

the case of independent and identically distributed (i.i.d.) Gaussian sources the authors

of [DSD20] show how to OTA compute sums over slow fading channels where the channel

state information is available neither at the transmitter nor the receiver. The work also

considers intersymbol interference and provides an asymptotic theoretical analysis as well

as numerical results.

1.4 Applications of Over-the-Air Computation

OTA computation has potential applications in every setting in which such a large num-

ber of wireless devices share constrained wireless resources that it becomes inefficient or

even infeasible to exclusively use traditional scheduling and separate decoding of all trans-

mitted information before it is post-processed at the receiver. Furthermore, even if the

available resources are tremendous, but the number of participating devices is so large

that traditional scheduling becomes prohibitively expensive, OTA computation can be a

useful tool to solve the problem. On the other hand, it inherently fuses concepts that have

traditionally been separate in communication systems. We have already discussed the

point that from an information theoretic perspective, it is a joint source-channel approach

that breaks with the traditional separation paradigm. But also from the perspective of

network architecture, it means using schemes on the physical layer that are at least in part

tailored to specific applications, and traditional methods of scheduling and routing have

to be adapted to be compatible. Therefore, OTA computation can be seen as a cross-layer

approach that encompasses the entire network stack from the application layer all the way

down to the physical layer. While the pre- and post-processing schemes can be proposed

in such a generic manner that they can in principle be used for a large variety of potential

applications, they still need to be carefully adapted to each one. There are two main

fields of application that have recently motivated the development of OTA computation

schemes, namely distributed OTA ML and consensus algorithms. In this section, we give

a brief overview of these two applications.

1.4.1 Distributed Machine Learning

In this subsection, we take a look at distributed ML, in particular Federated Learning (FL),

describe how this field branches into VFL and Horizontal Federated Learning (HFL) and

1.4 Applications of Over-the-Air Computation

cite a few examples from the literature that approach FL problems with OTA computation

methods. First, we need to define what ML is for the sake of this thesis, and we follow

the formalism in [SC08].

Definition 5. Astatistical inference problem is a tuple (X,Y,P,L), where

•the feature alphabet Xis a Polish space (usually a high-dimensional Euclidean

space),

•Y⊆Ris called the label alphabet,

•Pis a probability measure on X×Y,

•L:X×Y×R→[0,∞)is called the loss function.

In the usual application setting, only the feature and label alphabets and the loss func-

tion are known about the statistical inference problem, while information about Pis only

known indirectly through a training sample.

Definition 6. Given a statistical inference problem (X,Y,P,L), a training sample of

length nis a sequence (xj, yj)n

j=1 ∈Xn×Ynwhere each (xj, yj)is drawn i.i.d. according

to P.

The objective in solving an ML problem is to find an ML model which can make predic-

tions about the labels of newly drawn samples of P, given only the features. An ML model

is a mathematical object which provides, given a set of parameters, a labeling function.

Examples of ML models are neural networks, SVMs and decision trees.

Definition 7. Given a statistical inference problem (X,Y,P,L), a labeling function is

a function f:X→R. A labeling function induces a risk (sometimes also called loss)

RL,P:= EPL(X, Y, f(X)), where (X, Y )is the pair of random variables ranging over

X×Yand distributed according to P.

Typically, the objective is to exploit the indirect knowledge that we have about P

through the training sample to obtain a labeling function with low risk, which is usually

the measure for how well we have solved the ML problem. To this end, a training procedure

for a given ML model takes a training sample as its input and outputs parameters for the

ML model. Therefore, in conjunction with the model, it maps training samples to labeling

functions.

Distributed ML studies cases of ML problems where some of the information about the

statistical inference problem or the training sample are only known at certain locations in

a network. Although there are possibilities for communication between the agents in the

1 Introduction

network, there are application-specific reasons for not transmitting the entire information

to a central point. One particular instance of Distributed ML is called FL [KMRR16,

MMR+17]. In FL, the initial main reason for not transmitting all the available information

to a central point and then solving the ML problem in the traditional way is to preserve

the privacy of the users from whom the training data is collected,10 but communication

efficiency also plays an increasingly important role. FL can be further categorized into

HFL and VFL [YLCT19].

In HFL, each agent kout of a total of Kagents in the system sees only a subsequence

of the training sample (xjk,i , yjk,i )nk

i=1. In principle, it is possible for each agent to train

its own local ML model based on the locally available training sample. Depending on the

application at hand, however, this can incur several difficulties:

•The locally available training subsamples may simply be too small to train an ML

model and obtain an acceptable risk.

•The way in which the locally available training subsamples are drawn from the

overall training sample may be such that the subsamples are not i.i.d. or do not

follow P[ZLL+18]. For instance, it is common for the subsamples to be biased

towards certain labels in a way the overall training sample is not.

Distributed optimization algorithms can be used to carry out the training in a decentral-

ized manner. They make, either at one central point or everywhere in the network, a

trained ML model available that benefits from the whole training sample without trans-

mitting it through the network in its entirety. There is a huge body of recent research

(cf., e.g., [AG20b, ZWH20, ASK19, YJSD20, AG19, GdKS+19, ZDHL20, OUG19, AG20a,

ZLD+20, ZCL+19, AOGE20, SC20, SZG20, STL20, ADG19, CT20] and references therein)

into ways to perform distributed optimization algorithms such as stochastic gradient de-

scent exploiting OTA computation. This approach can achieve fundamentally more favor-

able scaling laws than would be possible otherwise.

In VFL, the data is distributed in a different way: In a system with Kagents, the

statistical inference problem has a feature alphabet X=X1×···×XKthat is a Cartesian

product of Kfeature spaces. A feature x∈Xcan therefore be written as a tuple x=

(x1, . . . , xK) and the training sample is of the form ((x1,j, . . . , xK,j), yj)n

j=1 where each

agent khas only the local training sample (xk,j, yj)n

j=1. Correspondingly, when training

is complete and a label needs to be estimated, each agent ksees only the projection

to Xkof the observed feature. Since the labeling function has the whole feature space

X=X1×···×XKas its domain, the arguments to compute it are not available at any

10A major motivation for introducing the FL framework was Gboard, a software made by Google which

is used as the default keyboard on many Android devices [MR17].

1.4 Applications of Over-the-Air Computation

single point in the network and it is therefore natural to attempt to compute the labeling

function OTA. So there are two important research questions in OTA-VFL:

•Given a training sample that is distributed as described above, how can we carry

out a distributed training procedure exploiting OTA computation that scales better

than linear in the number of agents involved?

•Given the trained model (which also is available only in a distributed manner), how

can we compute the labeling function using the OTA approach?

The first question is quite similar to the main research question in OTA-HFL and there

is some hope that tools from this field could be suitably adapted. The second question is

more specific to the VFL scenario, and we note that many standard ML labeling functions

naturally take the form of (weighted) sums. Examples are layers of neural networks (the

activation function can be evaluated afterwards in post-processing if necessary) and the

linear SVMs that have been used for OTA-VFL in [RGS16]. Contrary to OTA-HFL, there

does not appear to be a large body of research on OTA-VFL. Besides [RGS16] and the

work presented in Chapter 3, we are not aware of any works that propose to leverage

OTA computation in a VFL scenario. However, there is a related research area called

Type-Based Multiple-Access [MT06, MNT07] which is concerned with solving problems

very similar to those approached by OTA VFL such as anomaly detection in extremely

large networks. An important difference is that instead of transmitting analog values

directly, this approach relies on a prior quantization step and then exploits the fact that

the number of quantization levels usually does not grow with the number of transmitters

in the system. Furthermore, this approach uses statistical methods and knowledge about

the involved probability distributions, while the VFL approach that we use in Chapter 3 is

based on the ML paradigm and hence does not require a priori knowledge of the underlying

distributions.

1.4.2 Consensus over Wireless Channels

Consensus problems deal with combining opinions of participating agents to achieve an

agreement that encompasses their information about or subjective assessments of an ob-

ject. They have originally appeared as statistical problems in which the opinions are

probability distributions which have to be combined to form a consensus distribution.

In [EG59] this is illustrated as a horse race betting problem where the agents’ opinions are

probability distributions on which horse will win the race. They place their bets according

to these opinions and the overall track’s odds that result from these bets are considered

the consensus which in a certain way combines all the participating agents’ opinions. The

1 Introduction

problem has subsequently been stated as one of combining various experts’ opinions and

researched extensively to aid with decision making in the context of management sciences

(see, e.g., [Win68,Fre85] and references therein).

The research on this theory has later been applied to problems of multisensor fusion and

pattern recognition [BS92] and since found a multitude of other applications in engineering

sciences [OSFM07]. In some of the engineering applications the nature of the difficulty of

the problem has shifted significantly: Often, an opinion is simply a real number or vector

and the way the opinions have to be combined to form the consensus is fully prescribed

by the application at hand and is fairly simple compared to the original consensus prob-

lem. For instance, the consensus can be the arithmetic average (with applications, e.g.,

in formation control and flocking of autonomous vehicles [OS06]) or the maximum of the

opinions (examples for applications include task assignment [BCH08] and traffic automa-

tion [MDR19]). In these applications, the challenge is that it is infeasible to aggregate

the opinions in a central point because the communication cost or the time delay incurred

would be prohibitive. In these cases, distributed consensus algorithms are used that seek

to make the consensus value available to agents in a large network with a minimum of

communication required between the agents [OSFM07].

In many applications, the communication links between the agents are wireless channels,

and indeed, several agents can be linked to another agent via a wireless broadcast or

multiple-access channel. Some works that exploit these properties to reach average or

maximum consensus in a way that is more communication-efficient than would be possible

with point-to-point communication are [ICJ12,MSR18,MDR19,MASR21]. We expect that

theoretical analysis of OTA computation techniques could serve as a building block to

enhance the efficiency and in particular the scaling behavior of the communication cost

in the number of participating agents. Moreover, this way it would be possible to provide

additional theoretical error guarantees for consensus schemes that exploit the superposition

of signals in the wireless channel. In [9], we have proposed a maximum consensus scheme

which leverages analog OTA computation of sums to make the maximum of the agents’

opinions available at the receiver in a wireless MAC with no fading but with additive

noise. The OTA computation schemes proposed in this thesis can be used to extend

these results to channels exhibiting fast fading [2]. It is in particular worth noting that

the scheme proposed in [9] can OTA compute the maximum of the agents’ opinions in a

wireless channel although we do not expect the maximum function to satisfy Definition 10.

This is achieved not through a single OTA computation but through a multi-step protocol

that alternates between analog OTA computation of sums and digitally coded broadcast

communication. We believe therefore that such multi-step protocols are a potentially

promising approach to computing functions that are not amenable to the one-shot OTA

1.5 Contribution and Outline

computation methods we propose in this thesis. This is at the cost of higher system and

communication complexity, but a favorable scaling of communication cost in extremely

large networks would be retained.

1.5 Contribution and Outline

This thesis proposes schemes for the approximation of functions of distributed arguments

and deals with the question how OTA computation can be made robust under harsh

channel conditions. We also give examples of applications to distributed ML and undertake

first steps towards hardening such schemes against eavesdroppers. Although we provide

results of numerical simulations for OTA computation under harsh channel conditions

and its applications to distributed ML, the main focus of this thesis is on mathematically

proving theoretical guarantees. In this sense, our work is complementary to the many

predominantly empirical studies in the literature which investigate OTA computation and

its applications in specific real-world scenarios through numerical simulations.

The material contained in this introduction is based on the book chapter [7] with the ex-

ception of this subsection which uses excerpts from the publications the respective chapters

are based on.

In Chapter 2, we propose a distributed function approximation scheme which can be

applied to wireless channels to perform analog OTA computation without instantaneous

CSI. It is not necessary to assume any probability distribution on the data. Therefore,

the scheme works equally well for independently distributed data as it does for arbitrarily

correlated values. Our error analysis relies on mathematical tools from high-dimensional

statistics which we adapt so that they can be applied to the scenario at hand. The

resulting error guarantees are nonasymptotic and are valid for any fading and noise in

the class of sub-Gaussian distributions. This class contains Gaussian as well as many

non-Gaussian distributions of practical interest. Moreover, our scheme can deal with

correlated fading and noise and compute a larger class of functions than previous works

with theoretically proven bounds. In particular, we do not require linearity of the function

to be approximated, which is, e.g., demonstrated by the fact that we can compute p-norms

OTA. We conclude the chapter with numerical evaluations for a few selected cases.

The material in Chapter 2 is based on the conference publications [2,3] and the journal

publication [4]. The introductory remarks are in part based on the book chapter [7].

In Chapter 3, we propose applications of our scheme to the OTA computation of both

regressors and classifiers in VFL and validate the proposed OTA computation schemes

and the envisioned applications in ML with extensive numerical simulations for the case

of a binary classification problem.

1 Introduction

The material in Chapter 3 is based on the conference publications [2,3] and the journal

publication [4].

In Chapter 4, we propose a novel framework and result for incorporating security

considerations by including a friendly jammer in the system which deteriorates the eaves-

dropper’s SNR while not impacting the legitimate receiver’s ability to obtain an approxi-

mation of the function value which is to be OTA computed. As an example, we show how

this jamming strategy can be applied to an AWGN channel where the OTA computed

function is an arithmetic average. We prove that the security guarantee translates to a

lower bound on the mean square error of the eavesdropper’s function estimate. Our proofs

heavily rely on two information-theoretic results which we also prove in this chapter. The

first information-theoretic ingredient is a theorem on compound channel coding for con-

tinuous alphabets. It is a generalization of the result of the part of [RV68] which considers

finite Gaussian channels and we can consequently recover this result as a special case.

The second information-theoretic ingredient is an achievability theorem on resolvability of

channels with continuous alphabets. Note that the publication [1] on which this part of

the chapter is based contains also a converse result and a second-order direct result, both

of which are not part of this thesis.

The material in Chapter 4 is based on the conference publications [1,5], and on the sub-

mitted journal paper [6]. A revised version of the latter paper is currently under consider-

ation for publication. The introductory remarks are in part based on the book chapter [7].

Further Work. During my time at Technische Universit¨at Berlin as a PhD student, we

were able to obtain further results which are, however, not part of this thesis. They are

listed in chronological order.

•In [8], we revisit a secrecy proof for the MAC from the perspective of channel resolv-

ability and refine the approach to obtain novel results on the second-order achievable

rates.

•In a work with Navneet Agrawal [9], we propose a multi-step communication proto-

col based on the idea of OTA computation. The resulting scheme is able to compute

the maximum of a set of distributed values in a wireless network. While the com-

munication cost and complexity are greater than in the one-shot method studied

in this thesis, they exhibit a similarly favorable scaling behavior as the number of

transmitters in the system grows. Computation of the maximum function is relevant

in many distributed control systems and moreover, this example shows that OTA

computation can potentially be applied even to functions that are not contained in

the class of functions studied in this thesis.

1.5 Contribution and Outline

•In a series of works with Zoran Utkovski, Patrick Agostini, Miguel Gutierrez-Estevez

and Daniel Sch¨aufele [10–12], we investigate how physical layer security methods

could be integrated in future cellular networks. The security of this class of commu-

nication schemes is always based on the assumption that the channel of the legiti-

mate receiver is stronger than that of the eavesdropper. As a way of ensuring that

this assumption holds true in a real-world wireless network, we propose the concept

of secrecy maps. They are based on radio maps and help the infrastructure and

legitimate users of wireless networks to predict physical locations at which secure

communication is possible as well as give guidance on how conditions in other loca-

tions can be improved, for instance with the use of intelligent reflective surfaces or

friendly jamming.

•In [13], we propose a new proof method for direct coding theorems for wiretap

channels where the eavesdropper has access to a quantum version of the transmitted

signal on an infinite dimensional Hilbert space. The main part of this proof is a

direct coding result on channel resolvability which states that there is only a doubly

exponentially small probability that a standard random codebook does not solve the

channel resolvability problem for the classical-quantum channel.

2 Distributed Function Approximation in

Fading and Correlated Channels

In this section, we discuss our Distributed Function Approximation (DFA) scheme which

we proposed in [2] and extended in [3, 4]. The goal in introducing it was to provide a

flexible framework that can deal with such a large class of wireless channels that the

scheme would be robust to departures from common assumptions on the system model

such as Gaussianity of the fading and noise. At the same time, the class of functions for

OTA computation should contain the most relevant ones in current applications (which

are mainly weighted sums). It should also be large enough to provide flexibility and

make the DFA scheme applicable in scenarios where functions that have not yet received

much attention are computed OTA. Another important consideration in the design of the

scheme was the distribution of the sources. Many existing works on OTA computation

assume a particular source distribution for their theoretical analysis, and usually require

that the transmitted values are independently distributed between the transmitters. Since

this requirement is extremely difficult to check in practice, we have decided to not model

the sources stochastically. Instead, we show that the bound on the approximation error

is satisfied uniformly over all possible values of the sources. This yields a worst-case

analysis with theoretically proven error guarantees that are valid for every distribution

of the sources, even if there is arbitrary correlation between them. In addition, the error

bounds are nonasymptotic in the sense that they are valid for any number of channel uses,

not just for a sufficiently large one.

In theoretical works, it is of particular importance that the considered channel model is

sufficiently general so that the assumptions made are met in relevant practical scenarios.

The commonly considered Gaussian fading channel is an approximation that is often

adopted because it is relatively easy to treat and is quite close to reality in many scenarios

of interest.

However, it is well known and has been confirmed in extensive measurement cam-

paigns [MS93, BRB93, Mid99] that there are many natural and artificial sources of noise

that do not conform to the assumption of being i.i.d. Gaussian such as automobile igni-

tion, power line emissions, atmospheric disturbances or interfering wireless communica-

2 Distributed Function Approximation in Fading and Correlated Channels

tions [MS93].

Moreover, common arguments that the fast fading in wireless channels is Gaussian

employ the Central Limit Theorem [YC16, Section 2.4.1] and therefore assume a large

number of multipath components, which is not always the case. In scenarios with limited

mobility, it is also possible that the fast fading realizations are not independent between

channel uses. In the case of the OTA computation scheme proposed in [2], it can deteriorate

performance and therefore, it is desirable to be able to quantify this deterioration.

In this work, we therefore consider a channel model that encompasses a large class of pos-

sible probability distributions of fading and noise, the class of sub-Gaussian distributions.

The analysis provided is valid also for fading and noise exhibiting an arbitrary correla-

tion structure, with practically useful bounds in many relevant cases. In Section 2.1.1,

we define precisely what sub-Gaussian means and in Section 2.1.3, we give examples of

practically relevant cases that are covered by our system model assumptions.

2.1 System Model

2.1.1 Sub-Gaussian Random Variables

We begin with a short overview of the relevant definitions and properties of sub-Gaussian

random variables. More on this topic can be found in Section 2.6 and in [BK00, Wai19,

Ver18].

For a random variable X, we define11

τ(X) := inf {︂t > 0 : ∀λ∈R E exp (λ(X−EX)) ≤exp (︁λ2t2/2)︁}︂.(2.1)

exp(·) denotes the exponential function with Euler’s number e≈2.71828 as the basis.

Xis called a sub-Gaussian random variable if τ(X)<∞. The function τ(·) defines a

semi-norm on the set of sub-Gaussian random variables [BK00, Theorem 1.1.2], i.e., it is

absolutely homogeneous, satisfies the triangle inequality, and is non-negative. τ(X)=0

does not necessarily imply X= 0 unless we identify random variables that are equal almost

everywhere. Examples of sub-Gaussian random variables include Gaussian and bounded

random variables.

2.1 System Model

+ +

YMf

Figure 2.1: System model for DFA.

2.1.2 Channel and Correlation Model

We consider the following channel model with Ktransmitters and one receiver: For m=

1, . . . , M, the channel output at the m-th channel use is given by

Y(m) =

∑︂

k=1

Hk(m)Tk(m) + N(m).(2.2)

Here and hereafter, the notation is defined as follows:

•Tk(m)∈Cis a transmit symbol, where Cdenotes the set of complex numbers. We

assume a peak power constraint |Tk(m)|2≤Pfor k= 1, . . . , K and m= 1, . . . , M.

•Hk(m), k= 1 ...,K,m= 1, . . . , M, are complex-valued random variables such that

for every m= 1, . . . , M and k= 1, . . . , K, the real part HRe

k(m) and the imaginary

part HIm

k(m) of Hk(m) are sub-Gaussian random variables with mean zero and

variance 1.

•N(m), m= 1, . . . , M, are complex-valued random variables. We assume that the real

and imaginary parts NRe(m), NIm(m) of N(m) are sub-Gaussian random variables

with mean zero for m= 1, . . . , M.

In order to be able to apply a variation of the Hanson-Wright inequality as a tool, we

give the formal description of our dependence model in terms of matrices and vectors with

real entries.

We define

H:= (H(1),...,H(2M))T(2.3)

11Note that other norms on the space of sub-Gaussian random variables that appear in the literature

are equivalent to τ(·) (see, e.g., [BK00]). The particular definition we choose here matters, however,

because we derive results in which no unspecified constants appear.

2 Distributed Function Approximation in Fading and Correlated Channels

where ·Tdenotes the transpose and for m= 1, . . . , M,

H(2m−1) := (HRe

1(m), . . . , HRe

K(m))

H(2m) := (HIm

1(m), . . . , HIm

K(m)).

So His the vector of all fading coefficients. Similarly, let

N:= (NRe(1), NIm(1), . . . , NRe(M), NIm(M))T(2.4)

be the vector of all the instances of additive noise. The dependence model we consider

is such that there is a vector Gof (2KM + 2M) independent random variables with sub-

Gaussian norm at most 1 and matrices A ∈ R2KM×(2KM+2M)and B ∈ R2M×(2KM+2M)

such that

H=AG, N =BG.

Our correlation model captures both correlations between users and in the time domain,

but the impact of these two types of correlation on the performance of the proposed scheme

is different. For this reason, we need the following definition, which describes a scenario

where there can be arbitrary correlation in the time domain but no harmful correlation

between different users. In order to be as unrestrictive as possible, the following definition

prohibits only correlations between different users at the same complex dimension of the

same channel use.

Definition 8. We say that the fading is user-independent if for every k1=k2,j∈

{Re,Im}and m, the random variables Hj

k1(m)and Hj

k2(m)are independent.

In Section 2.3.1, we discuss the impact of deviations from the user-independence as-

sumption on the performance of the proposed method.

Remark 1. User-independent fading can be characterized based on the form of A. That

is, the fading is user-independent iff it can be written as H=AG with

A=⎛

⎜

⎝

A(1)

A(2M)

⎞

⎟

⎠,

where for all m,A(m)∈RK×(2MK+2M)and each A(m)has at most one nonzero entry per

column. This is because H(m) = A(m)Gand therefore at most one nonzero entry in a

column of A(m)means that at most one entry in H(m)depends on the entry in Gto which

2.1 System Model

the column refers.

2.1.3 Discussion of the System Model Assumptions

The most distinguishing feature of our channel model compared to assumptions made

in prior work on OTA computation is that we generalize the distribution of the fading

and noise to be sub-Gaussian and to feature arbitrary correlations. We argue that this

guarantees robustness against departure from standard assumptions such as independent

or Gaussian fading or noise. Regarding the chosen dependence model, we remark that in

the case of Gdistributed i.i.d. standard Gaussian, this amounts to a standard representa-

tion of arbitrarily correlated (and thus arbitrarily interdependent) multivariate Gaussian

vectors. Therefore, by replacing Gwith a vector of independent sub-Gaussian entries, we

obtain a straightforward generalization of the Gaussian case, which specializes to arbitrary

correlations (although in the non-Gaussian case, not to arbitrary stochastic dependence).

In particular, the following important cases are specializations of our channel model:

•Perfect Gaussian fast fading with i.i.d. bivariate Gaussian fading and additive white

Gaussian noise.

•Scenarios where the number of multipath components is not sufficiently large for an

appeal to the Central Limit Theorem to argue that the fading is complex Gaussian.

For instance, if there is only one multipath component with a length that has small

random variations, the resulting complex fading will have a distribution supported

on a narrow annulus in the complex plane. Such a distribution is not Gaussian, but

sub-Gaussian and therefore covered by our channel model.

•Scenarios where due to limited or slow movement of transmitters, receiver and envi-

ronment, the independence assumption between subsequent realizations of the chan-

nel fading is not satisfied or where the diversity in the radio environment is so small

that some of the users have correlated channels. One example of a widely used

channel model that is a special case of the one considered in this paper is the block

fading channel. In this case, we have a correlation of 1 between fading realizations

of the same block and 0 between fading realizations in different blocks.

•Any of the above scenarios where in addition to thermal additive noise, we have

interference from users outside the system. E.g., in the case of digital modulated

signals, such interference consists of a sequence of transmitted constellation points,

which is not Gaussian. Also, such interfering signals are inherently bursty in nature

and therefore cannot be argued to be independent between different points in time.

2 Distributed Function Approximation in Fading and Correlated Channels

However, signals from realistic transmitters are always bounded in amplitude and

therefore sub-Gaussian, which means that they are covered by our system model.

•Any of various types of artificial and natural interference that is not necessarily

Gaussian, but limited in power. [MS93,BRB93,Mid99] investigate various sources of

non-Gaussian interference of this type, both correlated and uncorrelated over time,

through theoretical modeling as well as extensive experiments and measurements

confirming the theoretical models and the non-Gaussianity of the sources. [MS93,

Table 2.1] enumerates several examples of this type of interference, including, e.g.,

solar radiation, automobile ignition and power line EM emissions.

The other main difference between our system model and the models used in existing

works is that we do not make any assumption on a distribution of the input data at the

transmitters. This means in particular that we cover arbitrary dependencies in the input

data, which can be very important for applications, because the input data can depend,

e.g., on local sensor readings recorded by devices or on training data collected for the same

ML problem. Therefore, in many relevant application scenarios, it is important that the

transmission schemes employed are robust even to high, but unknown levels of correlations

in the transmitted data.

2.2 Problem Statement

2.2.1 Distributed Approximation of Functions

Our goal is to approximate functions f:S1×. . . × SK→Rin a distributed setting.

The sets S1, . . . SK⊆Rare assumed to be closed and endowed with their natural Borel

σ-algebras, and we consider the product σ-algebra on the set S1×. . . ×SK. Furthermore,

the functions f:S1×. . . ×SK→Runder consideration are assumed to be measurable in

what follows.

Definition 9. An admissible DFA scheme for f:S1×. . . × SK→Rwith Mchannel

uses, depicted in Fig. 2.1, is a pair (EM, DM), consisting of:

1. A pre-processing function EM= (EM

1, . . . , EM

K), where each EM

kis of the form

k(sk) = (Ek(m, sk, Uk(m)))M

m=1 ∈CM

where Uk(1), . . . , Uk(M)are random variables and the map

(sk, u1, . . . , uM)↦→ (Ek(m, sk, um))M

m=1 ∈CM

2.2 Problem Statement

is measurable for skranging over Skand u1, . . . , uMranging over the same sets as

the i.i.d. random variables Uk(1), . . . , Uk(M). The encoder EM

kis subject to the peak

power constraint |Ek(m, sk, Uk(m))|2≤Pfor all k= 1, . . . , K and m= 1, . . . , M.

2. A post-processing function DM: The receiver is allowed to apply a measurable recov-

ery function DM:CM→Rupon observing the output of the channel.

So in order to approximate f, the transmitters apply their pre-processing maps to

(s1, . . . , sK)∈ S1×. . . × SKresulting in EM

1(s1), . . . , EM

K(sK), which are sent to the

receiver using the channel Mtimes. The receiver observes the output of the channel and

applies the recovery map DM. The whole process defines an estimate f

˜of f.

Let ε > 0, δ ∈(0,1) and f:S1×. . .×SK→Rbe given. We say that fis ε-approximated

after Mchannel uses with confidence level δif there is a DFA scheme (EM, DM) such

that the resulting estimate f

˜of fsatisfies

P(|f

˜(sK)−f(sK)| ≥ ε)≤δ(2.5)

for all sK:= (s1, . . . , sK)∈ S1×. . . ×SK. Let M(f, ε, δ) denote the smallest number of

channel uses such that there is an approximation scheme (EM, DM) for fsatisfying (2.5).

We call M(f, ε, δ) the communication cost for approximating a function fwith accuracy

εand confidence δ.

2.2.2 The Class of Functions to be Approximated

A measurable function f:S1×. . .×SK→Ris called a generalized linear function if there

are bounded measurable functions (fk)k∈{1,...,K}, with f(s1, . . . , sK) = ∑︁K

k=1 fk(sk),for all

(s1, . . . , sK)∈ S1×. . .×SK. The set of generalized linear functions from S1×. . .×SK→R

is denoted by FK,lin. Our main object of interest will be the following class of functions.

Definition 10. A measurable function f:S1×. . . ×SK→Ris said to belong to Fmon if

there exist bounded and measurable inner functions functions (fk)k∈{1,...,K}, a measurable

set A⊆Rwith the property f1(S1) + . . . +fK(SK)⊆A, a measurable outer function

F:A→Rsuch that for all (s1, . . . , sK)∈ S1×. . . ×SKwe have

f(s1, . . . , sK) = F(︄K

∑︂

k=1

fk(sk))︄,(2.6)

and there is a strictly increasing measurable function Φ : [0,∞)→[0,∞)with Φ(0) = 0

and

|F(a1)−F(a2)| ≤ Φ(|a1−a2|) (2.7)

2 Distributed Function Approximation in Fading and Correlated Channels

for all a1, a2∈A. We call the function Φan increment majorant of f.

Some examples of functions in Fmon are:

1. Obviously, all f∈ FK,lin belong to Fmon.

2. For any f∈ FK,lin and B-Lipschitz function F:R→Rwe have F◦f∈ Fmon with

Φ : [0,∞)→[0,∞), a↦→ Ba.

3. If f∈ FK,lin and Fis (C, α)-H¨older continuous, i.e., for all a1, a2∈A,|F(a1)−F(a2)| ≤

C|a1−a2|α,then F◦f∈ Fmon with Φ : a↦→ Caα.

4. For any p≥1 and S1,...,SKcompact, || · ||p∈ Fmon. In this example we have

fk(sk) = |sk|p,k= 1, . . . , K,F: [0,∞)→[0,∞), a↦→ a1/p, and F= Φ.

This can be seen as follows. We have to show that for all nonnegative a1, a2∈R

and p≥1 we have

|a1/p

1−a1/p

2|≤|a1−a2|1/p.(2.8)

We can assume w.l.o.g. that a1< a2holds. Then since

|a1/p

1−a1/p

2|=|a2|1/p (︄1−(︃a1

a2)︃1/p)︄

it suffices to prove that for all a∈[0,1] and p≥1 we have 1 −a1/p ≤(1 −a)1/p.

Now since a1/p +(1 −a)1/p ≥a+(1−a) = 1 for a∈[0,1] and p≥1, we can conclude

that (2.8) holds.

Given a function f∈ Fmon, we implicitly fix a representation (2.6) and define the total

spread of the inner part of f∈ Fmon as

∆

¯(f) :=

∑︂

k=1

(ϕmax,k −ϕmin,k),(2.9)

along with the max-spread

∆(f) := max

1≤k≤K(ϕmax,k −ϕmin,k),(2.10)

where

ϕmin,k := inf

s∈Sk

fk(s), ϕmax,k := sup

s∈Sk

fk(s).(2.11)

We define the relative spread of fwith power constraint Pas

∆(f∥P) := P·∆

¯(f)

∆(f).(2.12)

2.3 Main Result

We are now in a position to state our main theorem on approximation of functions in

Fmon. We use ∥·∥op and ∥·∥Fto denote the operator and Frobenius norm of matrices,

respectively.

Theorem 2. Let f∈ Fmon,M∈N, and the power constraint P∈(0,∞)be given. Let Φ

be an increment majorant of f. Assume the fading and noise are correlated as determined

by matrices Aand B. Let Ai∈R2MK×(2MK+2M)be a matrix in the form of Remark 1

which generates user-independent fading that approximates Ain the sense that

∥(A+Ai)(A−Ai)T∥op ≤η.

Then there exist pre- and post-processing operations such that

P(︁f

¯−f(s1, . . . , sK)≥ε)︁

≤2 exp (︃−MΦ−1(ε)2

16F+D+ 4Φ−1(ε)L)︃+ 2 exp (︃−MΦ−1(ε)2

256F+ 32Φ−1(ε)L)︃,(2.13)

where

L=(︄√︂∆

¯(f)∥A∥op +√︄∆(f)

P∥B∥op)︄2

F=L(︄√︃∆

¯(f)

M∥A∥F+√︄∆(f)

PM∥B∥F)︄2

D=(︃4√2M∆

¯(f)η+ 4 ∆(f)

√PM∥ABT∥F)︃2

In the following, we sketch how the pre- and post-processing schemes of Theorem 2

work. For a full formal definition, we refer the reader to the proof in Section 2.5. The

pre- and post-processing schemes are similar (although not identical) to the ones that

have appeared in [GS14,RGS16]. They are based on the same idea of combining random

phase shifts at the transmitter and averaging at the receiver to mitigate the impact of the

unknown fading and noise.

We consider fast fading (which can have any sub-Gaussian distribution) and assume

that CSI is available neither at the transmitter nor at the receiver. One example is i.i.d.

complex standard normal fading, which has a uniformly distributed phase. In this case,

since there is no CSI, the phase of the received signal cannot carry any useful information.

On the other hand, the phase difference between the signals from different transmitters

2 Distributed Function Approximation in Fading and Correlated Channels

plays a crucial role, since it determines whether the signals constructively or destructively

overlap at the receiver. We mitigate the impact of a destructive superposition by applying

a random phase shift at the transmitters and averaging the transmission over multiple

channel uses. This has the added benefit of averaging out additive noise.

In summary, the pre-processing is performed by applying the following steps at each

transmitter k(see Section 2.5.1):

•Apply the inner function fkfrom the nomographic representation (2.6).

•Shift and rescale to satisfy the power constraint.

•Apply a random phase shift Uk(m) that is independent for each channel use m.

One option for the random phase shift Uk(m) is to draw it uniformly from the complex

unit circle, but as we argue in the proofs, it is actually sufficient to draw it uniformly from

{−1,1}.

As described above, the phase of the received signal carries no useful information due

to the absence of CSI. Moreover, to compensate for the phase differences between the

transmitters and reduce the influence of additive noise, some form of averaging is required.

We therefore perform the following steps (see Section 2.5.2):

•Compute the total energy of the received signal.

•Subtract the energy of the additive noise over the receive time slot (this is statistical

information and does not require knowledge of the instantaneous noise realizations).

•Invert the rescaling and shift that have been applied during pre-processing.

•Apply the outer function Ffrom the nomographic representation (2.6).

We remark that these pre- and post-processing steps, while specific to the fast fading

scenario, are unmodified compared to the steps used in [2], where we considered only the

case of uncorrelated fading and noise. On the other hand, the error bound of Theorem 2

and the statistical tools used to prove it are different. Moreover, the error bound depends

on the correlation structure of the fading and noise, while the pre- and post-processors do

not need this information.

Remark 2. If no suitable user-independent approximation for Ais available, we can

always choose Ai:= 0, which results in η=∥A∥2

op.

Remark 3. In order to gain more freedom for optimizing the bound for a given cor-

relation structure, it is possible to replace Aiin Theorem 2 with AiAU, where AU∈

R(2MK+2M)×(2MK+2M)is a unitary matrix. This requires only minor adaptations in the

proof of the theorem.

2.3 Main Result

Corollary 1. For the approximation communication cost, we have

M(f, ε, δ)≤log 4 −log δ

Φ−1(ε)2Γ,(2.14)

where

Γ := max (︁16F+D+ 4Φ−1(ε)L,256F+ 32Φ−1(ε)L)︁.

Proof. We upper bound (2.13) as

P(|f

¯−f(sK)| ≥ ε)≤4 exp (︃−MΦ−1(ε)2

Γ)︃,

and solve the expression for Mconcluding the proof.

Remark 4. If Fis B-Lipschitz continuous, we can replace Φ−1(ε)in (2.14) and the

expression for Γwith ε/B.

2.3.1 Special Cases of Theorem 2

In this subsection we discuss the bound of Theorem 2 and illustrate it with two examples.

In order to be a useful bound, (2.13) should approach 0 as M→ ∞. Clearly, it does so

exponentially whenever L,Fand Dare bounded for M→ ∞.

For D, we observe that ABTis 0 if the fading is independent of the additive noise (which

we usually expect to be the case in practically relevant scenarios) and that ηis 0 in the

case of user-independent fading in the sense of Definition 8, which means that the fading

of any one user is independent of the fading of the other users (arbitrary correlations in

the time domain are still allowed). Since Dgrows proportionally with Mη2, we can see

that our bound is not useful for the case of strong correlations between users. Therefore,

in the presence of user-correlated fading, the usefulness of the bound depends on the

scaling behavior of Mη2. When this term exhibits sublinear growth, the error bound of

Theorem 2 approaches 0 as M→0, and if it is additionally upper bounded, the error

bound does so exponentially. In this sense, the bound is robust to small deviations from

the assumption of user-independence. However, it is important to note that even the

user-independent case covers relevant cases of correlation in the time domain such as the

block fading channel. In the expression of F, we can see that the Frobenius norm of both

Aand Bshould not grow faster than √Mand finally, in the expression of L, we see that

the operator norms of Aand Bshould not grow with M. We illustrate that this is the

case in scenarios of interest with the following two examples.

2 Distributed Function Approximation in Fading and Correlated Channels

Corollary 2. In the setting of Theorem 2 with uncorrelated fading and noise, i.e.,

A:= (︂σFid2MK 0)︂,B:= (︂0σNid2M)︂,(2.15)

where idndenotes the n×nidentity matrix, we have

P(︁f

¯−f(s1, . . . , sK)≥ε)︁

≤2 exp (︃−MΦ−1(ε)2

16F′+ 4Φ−1(ε)L′)︃+ 2 exp (︃−MΦ−1(ε)2

256F′+ 32Φ−1(ε)L′)︃,(2.16)

where

L′=(︄√︂∆

¯(f)σF+√︄∆(f)

PσN)︄2

F′=L′(︄√︂2K∆

¯(f)σF+√︄2∆(f)

PσN)︄2

Proof. Note that ABT= 0, ∥A∥op =σF,∥B∥op =σN,∥A∥F=√2MKσFand ∥B∥F=

√2MσN; pick Ai:= Aand substitute this into (2.13).

Corollary 3. In the setting of Theorem 2 where each user has a block fading channel with

block length β, i.e.,

A:= σF

⎛

⎜

⎝

id2K

id2K0

id2K

id2K0

...

0id2K

id2K

⎞

⎟

⎠

⎫

⎪

⎬

⎪

⎭

⎫

⎪

⎬

⎪

⎭

⎫

⎪

⎬

⎪

⎭

B:= (︂0σNid2M,)︂

2.3 Main Result

we have a bound of the form (2.16), where

L′=(︄√︂∆

¯(f)βσF+√︄∆(f)

PσN)︄2

F′=L′(︄√︂2K∆

¯(f)σF+√︄2∆(f)

PσN)︄2

Proof. Note that ABT= 0, ∥A∥op =σF√β,∥B∥op =σN,∥A∥F=√2MKσFand ∥B∥F=

√2MσN; pick Ai:= Aand substitute this into (2.13).

2.3.2 Sharpness of the Bound in Theorem 2

We do not expect the bound (2.13) to be sharp in the sense that there are non-trivial

examples in which it holds with equality. This, we believe, is in part a price that we pay

for using a very general system model, but it is also due to the underlying tools from

high-dimensional statistics that we employ. A further sharpening of this bound could be

an interesting question for future research, but it would hinge on optimizing the bounds

of some of the basic results that we use (such as Lemma 2). In some special cases (such as

uncorrelated Gaussian noise and fading) it is not hard, however, to compute exact bounds,

as can for instance be seen in (2.18) below. In the sequel, we argue that in a sense that

will be made precise, the bound (2.13) is sharp “up to absolute constants”. The example

case for which we show that the bound holds with equality up to constants is uncorrelated

Gaussian fading and noise so that the bound specializes to (2.16). For the purpose of this

section, we focus on the behavior with varying Mand ε, while we consider everything else

constant system parameters.

Theorem 3. In the case of uncorrelated Gaussian fading and noise; i.e., (2.15) is satisfied

and the entries of Gare i.i.d. standard Gaussian, there are constants cand Csuch that

the estimate f

¯obtained by the pre- and post-processing schemes described in Sections 2.5.1

and 2.5.2 satisfies

P(︁f

¯−f(s1, . . . , sK)≥ε)︁≥cexp (︁−CM min(Φ−1(ε),Φ−1(ε)2))︁(2.17)

for suitable choices of Fand Φsuch as F=Φ=id (the identity function).

Note that the upper tail bound (2.16) also has the same form for suitably chosen cand

C, so that we can conclude that it is sharp up to the values of these constants.

Proof. The proof is relatively straightforward, so we only sketch it. Under the assumptions

2 Distributed Function Approximation in Fading and Correlated Channels

made in Theorem 3, we readily compute from the equations in Sections 2.5.1 and 2.5.2

∥Y∥2

σ2

F∑︁K

k=1 ak+σ2

=∥G∥2

Since the entries of Gare i.i.d. standard Gaussian and the vector has 2Mentries, ∥G∥2

clearly follows a chi-square distribution with 2Mdegrees of freedom. We therefore have

in parallel to (2.24)

P(︁f

¯−f(s1, . . . , sK)≥ε)︁≤P(︄∥G∥2

2−E∥G∥2

2≥2PMΦ−1(ε)

∆(f)(σ2

F∑︁K

k=1 ak+σ2

N))︄.(2.18)

The bound here is sharp in the sense that it holds with equality in case F= Φ = id. We

can now use [ZZ20, Corollary 3] to conclude that in this case, (2.17) holds for suitable c

and C.

2.4 Numerical Results

We have simulated the DFA scheme for Rayleigh fading channels with varying noise power,

number of users and amount of channel resources. The simulations were done for two

different functions, with the function arguments in both cases confined to the unit interval

[0,1], to highlight different aspects and properties of the scheme: The arithmetic mean

function is linear and maps only to the interval [0,1] (which means that no scheme can

have an error larger than 1), while the Euclidean norm function maps to [0,√K] and can

show how the DFA scheme deals with nonlinearities.

We compare with a simple Time Division Multiple Access (TDMA) scheme, in which

each user transmits separately in its designated slot, protecting the analog transmission

against channel noise in the same fashion as the DFA scheme, but not sharing the channel

use with other transmitters. In the case where the number of channel uses available is

much larger than the number of users sharing the resources, this form of a TDMA scheme

is of course highly suboptimal, as the transmitters could use source and channel coding to

achieve a higher reliability. However, such an approach is infeasible if the number of users

is so high in comparison to the number of channel uses that only a few or possibly even less

than one channel use is available to each user, and in this work we are mainly interested in

the scaling behavior of our schemes in the number of users K. Therefore, this comparison

provides an insight into the gain achieved by exploiting the superposition properties of

the wireless channel while keeping in mind that for the regime of low K, there are better

coded schemes available. We also remark that the DFA scheme only needs coordination

2.4 Numerical Results

mean norm DFA TDMA

−20 −10 0 10 20 30 40 50

10−5

10−3

10−1

101

noise power in dB

mean squared error

K M

40 103

2560 105

Figure 2.2: Mean squared error of the approximation schemes dependent on the channel

noise power.

between the transmitters insofar as all users need to transmit roughly at the same time,

while a TDMA scheme necessitates an allocation of the channel uses to the individual

transmitters, which can be costly in the case of high K. The simulations carried out in

this section do not consider this scheduling problem and assume for the TDMA scheme

that the time slots have already been allocated, and this knowledge is available at both the

transmitters and the receiver. If M < K, there is not at least one channel use available

to each user and the TDMA scheme can therefore not be carried out. We set the error in

such cases to the maximum of 1 or √K, respectively.

For the simulations, we assume a normalized peak transmitter power constraint of 1 and

channels with fading normalized to a variance of 1 per complex dimension. The power

of the additive noise is given in dB per complex dimension and its negative can therefore

be considered as the signal-to-noise ratio (SNR). Each plotted data point is based on an

average of 1000 simulation runs.

The messages transmitted by the users are generated in the following way: First, we

draw a value µ, which is common to all transmitters, uniformly at random from [0,1].

We then draw the messages of all the users from a convex combination of the uniform

distributions on [0, µ] and [µ, 1] where we choose the weights in such a way that each

2 Distributed Function Approximation in Fading and Correlated Channels

mean norm DFA TDMA

101102103

10−5

10−3

10−1

101

users

mean squared error

noise M

−20 dB 103

20 dB 105

Figure 2.3: Mean squared error of the approximation schemes dependent on the number

of participating transmitters.

mean norm DFA TDMA

0.2 0.4 0.6 0.8 1

·105

10−5

10−3

10−1

101

channel uses

mean squared error

noise/dB K

−20 40

30 2560

Figure 2.4: Mean squared error of the approximation schemes dependent on the number

of channel uses.

2.5 Proof of Theorem 2

message has expectation µ. The reason for choosing this procedure although the DFA

scheme also performs well for more natural distributions such as i.i.d. uniform in [0,1] for

all users is that in case of messages distributed according to a known i.i.d. distribution, the

problem is too easy in the sense that both the mean and the Euclidean norm concentrate

around values that depend only on the distribution and K, and therefore even without

any communication at all, the function value can be quite accurately guessed if Kis large.

On the other hand, we intend the DFA scheme for applications in which the messages can

be correlated and distributed according to unknown distributions, so we opt for this form

of correlation between the messages for the sake of the numerical evaluation.

In Fig. 2.2, we can see that the DFA scheme is at least as good as the TDMA schemes for

all the plotted data points and outperforms it in most cases, achieving a gain of up to 30 dB

for K= 2560. For low powers of the additive noise, the effect of the multiplicative fading

dominates, and therefore, the error saturates as the additive noise grows weaker. Fig. 2.3

illustrates that the DFA scheme performs significantly better if the number of users is not

too low, which is due to the superposition of the signals in the wireless channel resulting

in a combined signal strength that grows with the number of users. We can also see the

TDMA scheme performing similarly to the DFA scheme for low numbers of users, while

quickly deteriorating in performance or even becoming infeasible as their number grows.

In Fig. 2.4, we can observe the exponential decay of the error as the amount of channel

resources used increases. Once again, we can observe that the TDMA scheme performs

similarly to DFA for a low number of users, but becomes infeasible for larger K.

2.5 Proof of Theorem 2

2.5.1 Pre-Processing

In the pre-processing step we encode the function values fk(sk), k= 1, . . . , K as transmit

power:

Tk(m) := √akUk(m),1≤m≤M

with ak=gk(fk(sk)), where gk: [ϕmin,k, ϕmax,k]→[0,P] such that

gk(a) := P

∆(f)(a−ϕmin,k),(2.19)

where ∆(f) is given in (2.10) and ϕmin,k is defined in (2.11).

Uk(m), k= 1, . . . , K,m= 1, . . . , M are i.i.d. with the uniform distribution on {−1,+1}.

We assume the random variables Uk(m), k= 1, . . . , K,m= 1 ...,M, are independent of

Hk(m), k= 1, . . . , K,m= 1, . . . , M, and N(m), m= 1, . . . , M.

2 Distributed Function Approximation in Fading and Correlated Channels

We write the vector of transmitted signals at channel use mas

T(m) := (T1(m), . . . , TK(m))

and combine them in a matrix

Q:=

⎛

⎜

⎝

T(1) 0 0 0 0 . . . 0

0T(1) 0 0 0 . . . 0

0 0 T(2) 0 0 . . . 0

0 0 0 T(2) 0 . . . 0

...

0 0 0 . . . 0T(M) 0

0 0 0 . . . 0 0 T(M)

⎞

⎟

⎠

2.5.2 Post-Processing

The vector Yof received signals across the Mchannel uses can be written as Y=Q·H+N,

where Hand Nare given in (2.3) and (2.4). The post-processing is based on receive energy

which has the form

∥Y∥2

2=YTY= (QAG +BG)T(QAG +BG) = GTWG,

where we use

W:= (QA+B)T(QA+B) = ATQTQA+ATQTB+BTQA+BTB.(2.20)

Equivalently, we can phrase this as

∥Y∥2

∑︂

k=1

ak∥Hk∥2

2+N

¯sK,(2.21)

where Hk= (Hk(1), . . . , Hk(M)) is a vector consisting of complex fading coefficients,

and N

¯sK=∑︁M

m=1 N

¯sK(m) where the overbar notation is used to denote the complex

2.5 Proof of Theorem 2

conjugate. The random variables N

¯sK(m), m= 1, . . . , M, are given by

¯sK(m) :=

∑︂

k,k′=1

k=l

√akak′Hk(m)Hk′(m)Uk(m)Uk′(m)

+ 2 (︄N(m)

∑︂

k=1

√akHk(m)Uk(m))︄Re

+|N(m)|2.(2.22)

The receiver computes its estimate f

¯of f(s1, . . . , sK) as

¯:= F(g

¯(∥Y∥2

2−E∥N∥2

2)),

where

¯(a) := ∆(f)

2·M·Pa+

∑︂

k=1

ϕmin,k.

2.5.3 The Error Event

Clearly, EN

¯sK(m) = E|N(m)|2(since all the other summands in (2.22) are centered). We

can therefore conclude

E(︁g

¯(︁∥Y∥2

2−E∥N∥2

2)︁)︁=g

¯(︁E∥Y∥2

2−E∥N∥2

2)︁=

∑︂

k=1

fk(sk).

We use this to argue

f

¯−f(s1, . . . , sK)=

F(︁g

¯(︁∥Y∥2

2−E∥N∥2

2)︁)︁−F(︄K

∑︂

k=1

fk(sk))︄

≤Φ(︄

¯(︁∥Y∥2

2−E∥N∥2

2)︁−

∑︂

k=1

fk(sk))︄

= Φ (︁g

¯(︁∥Y∥2

2−E∥N∥2

2)︁−g

¯(︁E∥Y∥2

2−E∥N∥2

2)︁)︁

= Φ (︃∆(f)

2MP∥Y∥2

2−E∥Y∥2

2)︃(2.23)

and therefore

P(︁f

¯−f(s1, . . . , sK)≥ε)︁≤P(︃∥Y∥2

2−E∥Y∥2

2≥2MP

∆(f)Φ−1(ε))︃.(2.24)

2 Distributed Function Approximation in Fading and Correlated Channels

2.5.4 Performance Bounds

Our objective is now to establish the concentration of ∥Y∥2

2around its expectation and

thus obtain an upper bound for the right hand side of (2.24). To this end, we first need

to establish a series of lemmas that we will use as tools.

We will split the deviation from the mean into a diagonal and an off-diagonal part. The

first lemma will later help us bound the diagonal part of the error.

Lemma 1. Let V1,...,Vnbe independent random variables and centered with sub-Gaussian

norm at most 1. Let A1,...,Anbe random variables independent of V1,...,Vnbut not

necessarily of each other, and assume that for all k,|Ak| ≤ L

˜and ∑︁n

k=1 A2

k≤F

˜almost

surely. Then we have for any c∈(0,1) and any λ∈(−c/(2L

˜), c/(2L

˜)),

Eexp (︄λ

∑︂

k=1 (︁AkV2

k−E(AkV2

k))︁)︄≤exp (︄λ2

2·8F

1−c)︄Eexp (︄λ

∑︂

k=1 (︁Ak−E(Ak))︁)︄.

The proof of this lemma relies on some other technical results and is therefore relegated

to Section 2.7.

The next lemma is a slight variation of the Hanson-Wright inequality as phrased in [Ver18,

Theorem 6.2.1] and will help us bound the off-diagonal part of the error.

Lemma 2. Let Vbe an Rn-valued random variable with independent, centered entries

and assume that for all k∈ {1, . . . , n}, the k-th entry of Vsatisfies τ(Vk)≤τmax. Let

A ∈ Rn×nwith zeros on the diagonal and ε > 0. Suppose further that ∥A∥op ≤Aop and

∥A∥F≤AF. Then E(︁VTAV)︁= 0 and

P(︁VTAV≥ε)︁≤2 exp (︃−ε2

16τ2

maxεAop + 256τ4

maxA2

F)︃.(2.25)

This lemma differs from [Ver18, Theorem 6.2.1] mainly in that we require the diagonal

entries of Ato be 0 and that all the constants are explicit. The proof follows [Ver18] closely

and is given in Section 2.7. We remark that it is not hard to follow the proof in [Ver18]

further and expand the result to matrices with non-zero diagonal elements, however, this

is not relevant for the present work.

Mainly because the matrix Wcontains randomness, we need a slight modification of

this lemma as well as two more lemmas exploring some specific properties of W.

Corollary 4. Assume a setting as in Lemma 2, but let Abe an Rn×n-valued random

variable independent of Vsuch that almost surely, the diagonal entries of Aare 0,∥A∥op ≤

Aop and ∥A∥F≤AF. Then E(︁VTAV)︁= 0 and (2.25), considering joint expectation,

respectively probability of Vand A, still hold.

2.5 Proof of Theorem 2

Proof. E(︁VTAV)︁= 0 as well as (2.25) hold conditional on any realization of A(except

possibly in a null set) and therefore, the Corollary follows by the laws of total expectation

and total probability.

Lemma 3. We have almost surely

∥W∥F≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂(︂√︁∆(f||P)∥A∥F+∥B∥F)︂,

∥W∥op ≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂2.

Proof. In order to bound the norm of W, we first note that

QQT=

∑︂

k=1

akid2M.(2.26)

Therefore, we can conclude that all singular values of Qare bounded by √︁∆(f||P) and

thus ∥Q∥op ≤√︁∆(f||P).

Noting that ∥AB∥F≤ ∥A∥op∥B∥Ffor all matrices A, B of compatible dimensions and

further noting the submultiplicativity of the operator norm and the triangle inequality of

both norms, we get

∥W∥F≤ ∥QA+B∥op∥QA+B∥F≤(∥Q∥op∥A∥op +∥B∥op) (∥Q∥op∥A∥F+∥B∥F)

≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂(︂√︁∆(f||P)∥A∥F+∥B∥F)︂,

∥W∥op =∥QA+B∥2

op ≤(∥Q∥op∥A∥op +∥B∥op)2≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂2

Lemma 4. We have

τ(trW)≤4M∆(f||P)∥(A+Ai)(A−Ai)T∥op + 2√︁2P∥ABT∥F,(2.27)

where the trace tr(·)is the sum of elements on the diagonal of a square matrix.

Proof. With an addition of zero, we can rewrite

tr (︁ATQTQA)︁= tr (︁ATQTQA)︁+ tr (︁ATQTQAi)︁−tr (︂(Ai)TQTQA)︂

−tr (︂(Ai)TQTQAi)︂+ tr (︂(Ai)TQTQAi)︂

= tr (︂(A−Ai)TQTQ(A+Ai))︂+ tr (︂(Ai)TQTQAi)︂

2 Distributed Function Approximation in Fading and Correlated Channels

and use this together with (2.20) to conclude

trW= tr (︂(A−Ai)TQTQ(A+Ai))︂+ 2tr (︁BTQA)︁+ tr (︂(Ai)TQTQAi)︂+ tr (︁BTB)︁.

(2.28)

Next, we argue that the last two summands are almost surely constant. For tr(BTB) this

is immediately clear. Moreover, we have tr(︂(Ai)TQTQAi)︂=∥QAi∥2

F.We note that as

per Remark 1 and using corresponding notation, we have

QAi=

⎛

⎜

⎝

T(1)A(1)

i0 0 . . . 0

0T(1)A(2)

i0. . . 0

...

0. . . 0T(M)A(2M−1)

0. . . 0 0 T(M)A(2M)

⎞

⎟

⎠

and because each A(m)

ihas only one nonzero entry per column, each entry of QAiis the

product of Uk(m) with a deterministic term for some m, k and therefore, its square can

take only one value almost surely, and consequently, ∥QAi∥2

Falso takes only one value

almost surely.

We can use this in (2.28) and incorporate the triangle inequality to obtain

τ(trW)≤τ(ξ1)+2τ(ξ2),

where

ξ1:= tr (︂(A−Ai)TQTQ(A+Ai))︂

ξ2:= tr (︁BTQA)︁.

To the end of bounding τ(ξ1), we argue

ξ1= tr (︂Q(A+Ai)(A−Ai)TQT)︂≤ ∥(A+Ai)(A−Ai)T∥optr (︁QQT)︁

≤2M∆(f||P)∥(A+Ai)(A−Ai)T∥op.

The first inequality holds because for any square matrix Aand compatible column vector

v, we have

vT(∥A∥opid −A)v=∥v∥2

2(︄∥A∥op −(︃v

∥v∥2)︃T

∥v∥2)︄≥0

2.5 Proof of Theorem 2

(see, e.g., [Bha97, Exercise I.2.10]) and therefore ∥A∥opid−Ais positive semidefinite. The

second inequality directly follows from (2.26). It follows, e.g., from [BK00, Example 1.1.2],

that τ(ξ1) is upper bounded by the first summand on the right hand side in (2.27). Note

that in an analogous way, we can derive the same bound for −ξ1.

In order to bound the sub-Gaussian norm of ξ2, we view it as a function of (Uk(m))K,M

k,m=1

and use part of the proof of the Bounded Differences Inequality [BLM13, Theorem 6.2] to

bound the moment generating function. To this end, we define

(Ei,j)i′,j′=⎧

⎨

⎩

1, i′=iand j′=j

0,otherwise.

and note that a change in the value of Uk(m) changes the value of ξ2by

2√aktr (︁BT(E2m−1,K(2m−2)+k+E2m,K(2m−1)+k)A)︁

= 2√aktr (︁ABT(E2m−1,K(2m−2)+k+E2m,K(2m−1)+k))︁

= 2√ak(︁(ABT)K(2m−2)+k,2m−1+ (ABT)K(2m−1)+k,2m)︁

≤2√︁P(︁(ABT)K(2m−2)+k,2m−1+ (ABT)K(2m−1)+k,2m)︁

Following the proof of the Bounded Differences Inequality [BLM13, Theorem 6.2], we can

now conclude

τ(ξ2)2≤1

∑︂

k=1

∑︂

m=1 (︂2√︁P(ABT)(2m−2)K+k,2m−1+ 2√︁P(ABT)(2m−1)K+k,2m)︂2

≤1

4·2·4·P∥ABT∥2

concluding the proof of the lemma.

Proof of Theorem 2. What remains to be established is the concentration of ∥Y∥2

2around

its expectation. To this end, we observe

P(︁∥Y∥2

2−E∥Y∥2

2≥ε)︁=P(︁GTWG −E(GTWG)≥ε)︁≤P(︂|Σ1| ≥ ε

2)︂+P(︂|Σ2| ≥ ε

2)︂

(2.29)

where

Σ1:=

2KM+2M

∑︂

i=1 (︁G2

iWi,i −E(︁G2

iWi,i)︁)︁

2 Distributed Function Approximation in Fading and Correlated Channels

Σ2:=

2KM+2M

∑︂

i,j=1

i=j

GiGjWi,j.

We use Lemma 1, Lemma 3 and Lemma 4 to bound the moment generating function of

Σ1as

Eexp(λΣ1)≤exp (︄λ2

2(︄8F

˜1

1−c+F

˜2)︄)︄≤exp (︄λ2

2·8F

˜1+F

˜2

1−c)︄

for any c∈(0,1) and λ∈(−c/(2L

˜)), c/(2L

˜)), where

˜:= (︂√︁∆(f||P)∥A∥op +∥B∥op)︂2

˜1:= L

˜(︂√︁∆(f||P)∥A∥F+∥B∥F)︂2

˜2:= (︂4M∆(f||P)∥(A+Ai)(A−Ai)T∥op + 2√︁2P∥ABT∥F)︂2

By Lemma 9, this yields

P(︂|Σ1| ≥ ε

2)︂≤2 exp (︃−(1 −c)ε2

64F

˜1+ 8F

˜2)︃

in case 0 < ε ≤c

1−c·8F

˜1+F

˜2

˜and

P(︂|Σ1| ≥ ε

2)︂≤2 exp (︃−cε

˜)︃

otherwise. Since the first case term is increasing with cand the second case term is

decreasing, the optimal value for cis where the two cases meet, which is at

c=L

˜ε

˜ε+ 8F

˜1+F

˜2

Substituting this, we get

P(︂|Σ1| ≥ ε

2)︂≤2 exp (︃−ε2

64F

˜1+ 8F

˜2+ 8L

˜ε)︃.

Turning our attention to Σ2, we note that by [BCD89, Theorem 2.1] the operator norm of

the off-diagonal part of Wcan be upper bounded by 2∥W∥op and thus by 2L

˜. Therefore,

2.6 Preliminaries on Sub-Gaussian and Sub-Exponential Random Variables

we can directly apply Lemma 2 and get

P(︂|Σ2| ≥ ε

2)︂≤2 exp (︃−ε2

1024F

˜1+ 64L

˜ε)︃.

Substituting these into (2.29) and using (2.24) concludes the proof.

2.6 Preliminaries on Sub-Gaussian and Sub-Exponential Random

Variables

We begin with a definition that is adapted from [Ver18, Definition 3.4.1]. For Rn-valued

random variables V, we define the sub-Gaussian norm as

τ(V) := inf {︂a:∀v∈Sn−1∀λ∈R E exp(λ⟨V, v⟩)≤exp (︃a2λ2

2)︃}︂(2.30)

and we observe that if all entries of Vhave a sub-Gaussian norm bounded by τmax and

are independent, we have for any v∈Sn−1

Eexp (λ⟨V, v⟩) = Eexp (︄λ

∑︂

k=1 Vkvk)︄=

∏︂

k=1

Eexp (λVkvk)

≤

∏︂

k=1

exp (︃τ2

maxv2

kλ2

2)︃= exp (︃τ2

maxλ2

2)︃

and therefore τ(V)≤τmax.

In the following, we recall some basic definitions and results from [BK00, Chapter 1].

For a random variable Xwe define12

θ(X) := sup

k≥1(︃E(|X|k)

k!)︃1

(2.31)

If θ(X)<∞then Xis called a sub-exponential random variable. θ(·) defines a semi-

norm on the vector space of sub-exponential random variables [BK00, Remark 1.3.2].

Typical examples of sub-exponential random variables are bounded random variables and

random variables with exponential distribution. We collect some useful properties of

and interrelations between the sub-exponential and sub-Gaussian norms in the following

lemma.

12Note that as with our definition of the sub-Gaussian norm, other norms on the space of sub-exponential

random variables that appear in the literature are equivalent to θ(·) (see, e.g., [BK00]). The particular

definition we choose here matters, however, because we derive results in which no unspecified constants

appear.

2 Distributed Function Approximation in Fading and Correlated Channels

Lemma 5. Let X, Y be random variables. Then:

1. If Xfollows the normal distribution with expectation µand variance σ2then we have

τ(X) = σ. (2.32)

2. (Rotation Invariance) If X1, . . . , XMare independent, sub-Gaussian and centered,

we have

τ(︄M

∑︂

m=1

Xm)︄2

≤

∑︂

m=1

τ(Xm)2(2.33)

3. If Xis a random variable with |X| ≤ 1with probability 1and if Yis independent of

Xand sub-Gaussian then we have

τ(X·Y)≤τ(Y).(2.34)

4. If Xand Yare sub-Gaussian and centered, then X·Yis sub-exponential and

θ(X·Y)≤2·τ(X)·τ(Y).(2.35)

5. (Centering) If Xis sub-exponential and X≥0almost surely, then

θ(X−E(X)) ≤θ(X).(2.36)

Proof. (2.32) follows in a straightforward fashion by calculating the moment generating

function of X. (2.33) is e.g. proven in [BK00, Lemma 1.1.7]. (2.34) follows directly from

the definition conditioning on X. We show (2.35) first for X=Y. In this case, we have

θ(︁X2)︁= sup

k≥1(︃EX2k

k!)︃1

≤sup

k≥1(︄2k+1kkτ(X)2k

ekk!)︄1

= 2τ(X)2sup

k≥1(︄21

e(k!)1

k)︄≤2τ(X)2,

where the first inequality is by [BK00, Lemma 1.1.4] and the second follows from 2kk/k!≤

ek, which is straightforward to prove for k≥1 by induction. In the general case, we have

θ(XY ) = τ(X)τ(Y)θ(︃XY

τ(X)τ(Y))︃

≤τ(X)τ(Y)θ(︄1

2(︃X

τ(X))︃2

2(︃Y

τ(Y))︃2)︄

≤2τ(X)τ(Y),

2.7 Proof of Lemmas 1 and 2

where the first inequality can be verified in (2.31), considering that ab ≤a2/2 + b2/2 for

all a, b ∈R, and the second inequality follows from the triangle inequality and the special

case X=Y.

For (2.36), we assume without loss of generality EX= 1 (otherwise we can scale X),

and note that for all a∈[0,∞) and k≥1, ak−|a−1|k> a−1 and thus E(Xk−|X−1|k)≥

E(X−1) = 0.

2.7 Proof of Lemmas 1 and 2

The proofs closely follow the proof of the Hanson-Wright inequality in [Ver18, Theorem

6.2.1]. We carry out the changes that are necessary to arrive at explicit constants. To this

end, we begin with some slightly modified versions of lemmas used as ingredients in the

proof of Bernstein’s inequality in [BK00, Theorem 1.5.2].

Lemma 6. Let Xbe a random variable with E(X)=0and θ(X)<+∞. For any λ∈R

with |λθ (X)|<1we have

E(exp(λX)) ≤1 + |λ|2θ(X)2·1

1−|λθ (X)|.

Proof. Let λ∈Rsatisfy |λθ (X)|<1. Then

E(exp(λX)) = 1 + ∞

∑︂

k=2

λkE(Xk)

≤1 + ∞

∑︂

k=2

|λ|kE(|X|k)

k!≤1 + ∞

∑︂

k=2 |λ|kθ(X)k

= 1 + |λ|2θ(X)2(︄∞

∑︂

k=0 |λθ (X)|k)︄

= 1 + |λ|2θ(X)2·1

1−|λθ (X)|,(2.37)

where in the last line we have used |λθ (X)|<1.

In the next lemma we derive an exponential bound depending on θ(X) on the moment

generating function of the random variable X.

Lemma 7. Let Xbe a random variable with E(X) = 0 and θ(X)<+∞. For any

c∈(0,1) and λ∈(︂−c

θ(X),c

θ(X))︂we have

E(exp(λX)) ≤exp(︂λ2

2·θ(X)2

1−c)︂.

2 Distributed Function Approximation in Fading and Correlated Channels

Proof. For λ∈(︂−c

θ(X),c

θ(X))︂we have

|λθ (X)|<c<1,(2.38)

therefore by Lemma 6

E(exp(λX)) ≤1 + |λ|2θ(X)2·1

1−|λθ (X)|

≤1 + |λ|2θ(X)2·1

1−c

≤exp(︂λ2

2·θ(X)2

1−c)︂,

where in the second line we have used the first inequality in (2.38) and the last line is by

the numerical inequality 1 + a≤exp(a) valid for a≥0.

We are now ready to prove Lemma 1.

Proof of Lemma 1. The lemma follows by a straightforward calculation

Eexp (︄λ

∑︂

k=1 (︁AkV2

k−E(AkV2

k))︁)︄

=E(︄exp (︄λ

∑︂

k=1 (︁Ak(V2

k−E(V2

k)))︁)︄·exp (︄λ

∑︂

k=1 (︁E(V2

k)(Ak−E(Ak))︁)︄)︄ (2.39)

=EA(︄exp (︄λ

∑︂

k=1 (︁E(V2

k)(Ak−E(Ak))︁)︄·

∏︂

k=1

EVexp (︁(λAk)(︁V2

k−E(V2

k))︁)︁)︄(2.40)

≤EA(︄exp (︄λ

∑︂

k=1 (︁E(V2

k)(Ak−E(Ak))︁)︄·

∏︂

k=1

exp (︃λ2

2·8A2

1−c)︃)︄(2.41)

≤exp (︄λ2

2·8F

1−c)︄E(︄exp (︄λ

∑︂

k=1

(Ak−EAk))︄)︄,(2.42)

where (2.40) follows by the independence assumptions, (2.41) is an application of Lemma 7

and (2.42) holds because ∑︁n

k=1 A2

k≤F

˜almost surely. For this last step, also note that the

variance of a sub-Gaussian random variable is upper bounded by the squared sub-Gaussian

norm [BK00, Lemma 1.1.2].

Lemma 8. Let X1, . . . , Xnbe independent random variables with E(Xi)=0and θ(Xi)<

+∞,i= 1, . . . , n. Let L:= max1≤i≤nθ(Xi),c∈(0,1), and λ∈(︁−c

L,c

L)︁. Then for

2.7 Proof of Lemmas 1 and 2

ΣM:= ∑︁n

i=1 Xiwe have

E(exp(λΣM)) ≤exp(︂λ2

2·∑︁n

i=1 θ(Xi)2

1−c)︂.(2.43)

Proof. By independence of X1, . . . , Xn, we have

E(exp(λΣn)) =

∏︂

i=1

E(exp(λXi)).

Combining this with Lemma 7 proves the lemma.

The next lemma establishes the basic tail bound for random variables satisfying inequal-

ities of type (2.43). The proof can be found in [BK00, Lemma 1.4.1].

Lemma 9. Let Xbe a random variable with E(X) = 0. If there exist τ≥0and Λ>0

such that

E(exp(λX)) ≤exp(︂λ2

2τ2)︂,

holds for all λ∈(−Λ,Λ), then for any t≥0, we have

P(|X| ≥ t)≤⎧

⎨

⎩

2 exp (︂−t2

2τ2)︂,0< t ≤Λτ2

2 exp (︁−Λt

2)︁,Λτ2≤t.

The following lemma is a slightly modified version of [Ver18, Lemma 6.2.3]. N(µ, Σ)

denotes the multivariate normal distribution with mean µand covariance matrix Σ.

Lemma 10 (Comparison Lemma).Let Vand V′be independent, Rn-valued, centered and

sub-Gaussian random variables, and let G, G′be independent and distributed according to

N(0,idn). Let A ∈ Rn×nand λ∈R. Then

Eexp(λVTAV′)≤Eexp(λτ (V)τ(︁V′)︁GTAG′.)

Proof. We first observe that for any v∈Rn,

E(exp(λ⟨V, v⟩)) = E(︃exp (︃λ∥v∥2⟨︃V,v

∥v∥2⟩︃)︃)︃

≤exp (︄λ2∥v∥2

2τ(V)2

2)︄

=E(︂exp(︁λτ (V)⟨G, v⟩)︁)︂,(2.44)

2 Distributed Function Approximation in Fading and Correlated Channels

where the inequality in (2.44) is by the definition of vector-valued sub-Gaussian random

variables and the equality is obtained by calculating the moment-generation function of

⟨G, v⟩. We can now conclude the proof from the following:

E(︂exp (︁λVTAV′)︁)︂=EV′(︃EV(︂exp (︁λ⟨︁V,AV′⟩︁)︁)︂)︃(2.45)

≤EV′(︃EG(︂exp (︁λτ (V)⟨︁G, AV′⟩︁)︁)︂)︃(2.46)

=EG(︃EV′(︂exp (︁λτ (V)⟨︁V′,ATG⟩︁)︁)︂)︃(2.47)

≤EG(︃EG′(︂exp (︁λτ (V)τ(︁V′)︁⟨︁G′,ATG⟩︁)︁)︂)︃(2.48)

=E(︂exp (︁λτ (V)τ(︁V′)︁GTAG′)︁)︂,(2.49)

where (2.45), (2.47) and (2.49) are due to Fubini’s theorem and elementary transformations

and (2.46) and (2.48) are both instances of the observation (2.44).

Proof of Lemma 2. We can write

VTAV =

∑︂

k,k′=1,k=k′VkAk,k′Vk′,(2.50)

and since Vis centered, E(︁VTAV)︁= 0 immediately follows. Let V′be an independent

copy of V, and let Gand G′be independently distributed according to N(0,idn). We

denote the singular values of Awith s1, . . . , sn. With these definitions, we bound the

moment-generating function of VTAV as

Eexp (︁λVTAV)︁=Eexp (︁λVTAV)︁(2.51)

≤Eexp (︁4λVTAV′)︁(2.52)

≤Eexp (︂4λτ (V)2GTAG′)︂(2.53)

=Eexp (︄4λτ (V)2

∑︂

k=1

ˆkG

ˆ′

ksk)︄(2.54)

≤exp (︄λ2

2·128τ(V)4∑︁n

k=1 s2

1−c)︄,(2.55)

where (2.52) is due to the Decoupling Theorem [Ver18, Theorem 6.1.1], (2.53) is an ap-

plication of Lemma 10, (2.54) holds for suitably transformed versions G

ˆ, G

ˆ′of G, G′(note

that they are still independent and follow the same distribution) and (2.55) is true if

2.7 Proof of Lemmas 1 and 2

c∈(0,1) and |λ|< c/(8τ(V)2max1≤k≤nsk) according to Lemma 8. So we can apply

Lemma 9 to obtain

P(︁VTAV≥ε)︁≤2 exp (︄−ε2(1 −c)

256τ(V)4∑︁n

k=1 s2

k)︄(2.56)

in case ε≤c

1−c·16τ(V)2∑︁n

k=1 s2

max1≤k≤nskand

P(︁VTAV≥ε)︁≤2 exp (︄−c·ε

16τ(V)2max1≤k≤nsk)︄

otherwise. We next choose cso as to minimize the upper bound on the tail probability.

Because the bound in the first case is increasing with cwhile it is decreasing in the second

case, the optimal choice for cis where the two cases meet. We can therefore calculate the

optimal cas

c=εmax1≤k≤nsk

εmax1≤k≤nsk+ 16τ(V)2∑︁n

k=1 s2

and substituting this in (2.56), we obtain

P(︁VTAV≥ε)︁≤2 exp ⎛

⎝−ε2

16ετ (V)2max

1≤k≤nsk+ 256τ(V)4∑︁n

k=1 s2

k⎞

⎠.

The bounds τ(V)≤τmax,|sk| ≤ ∥A∥op, and identity ∥A∥2

F=∑︁n

k=1 s2allow us to conclude

the proof of the lemma.

3 Applications of Distributed Function

Approximation to Vertical Federated

Learning

In this chapter, we give examples of how the results of Chapter 2 can be applied to ML

problems. We focus on a case that is particularly simple on the communication side (since

only one weighted sum is OTA computed) but which we expect to gain significant relevance

in practical systems. The application examples show how our DFA scheme can be leveraged

to vastly increase the efficiency of the prediction phase of VFL both in terms of time and

bandwidth resources (i.e., in our model, channel uses) and in terms of energy resources

expended. For the training phase, we will either assume centralized offline training or use

more communication-efficient decentralized methods that do not, however, leverage any

form of OTA computation. Developing distributed training algorithms for VFL which

can leverage the full power of OTA computation remains open as an interesting future

research direction. However, we argue that in many cases of interest the communication

cost incurred in the prediction phase can dominate that incurred in training and it is

therefore worthwhile to focus on the prediction. This can, e.g., be the case when the

training can be conducted offline and the models do not or only infrequently have to be

re-trained; or when the number of training samples is small, but the number of features

observed in the system is large and prediction tasks have to be carried out very frequently.

Contrary to [YLCT19], where the main focus in VFL is on providing privacy and se-

curity guarantees, in this work we focus on the communication efficiency of such schemes

under the use of OTA computation. Since the OTA-computed predictors as well as the

distributed training procedures we describe do not aggregate the observed features in a

central point, it is reasonable to expect that these methods have good inherent privacy

properties, and for some of the envisioned applications, such as, e.g., e-health, it is an im-

portant open question for future research how these privacy guarantees can be formalized

and perhaps strengthened in the context of OTA-computed ML predictors. In Chapter 4,

we address some of these concerns by providing formal security guarantees in the presence

of eavesdroppers.

3 Applications of Distributed Function Approximation to Vertical Federated Learning

In the following, we expand upon the idea of OTA VFL by showing in Section 3.1 how

the SVM approach can be generalized to the use of additive kernel SVMs and applied

to regression and classification problems, and in Section 3.2 how classifiers that do not

necessarily have to rely on SVMs at all can be constructed to solve binary classification

problems. In Section 3.3, we present simulation results of two classification schemes con-

structed as described in Section 3.2 and compare them to two baseline approaches.

Both in Section 3.1 and in Section 3.2, we construct an ML regressor or predictor that

has the form of a weighted sum, because such a function can be straightforwardly computed

OTA using the DFA scheme described in Section 2.5. If the loss is Lipschitz-continuous,

it can play the role of the function Fin Definition 10 so that Theorem 2 provides a tail

bound on the additional ML loss that the OTA-computed classifiers can incur in addition

to the loss they would have in case of noiseless communication. The detailed technical

statements and proofs of these facts can be found in Corollaries 5 and 6. We also give

some examples of applicable Lipschitz-continuous losses for regression and classification

that are commonly used in practice in Subsection 3.1.

3.1 Support Vector Machines with Additive Kernels for

Regression and Classification

In this section, we give an example of additive, and therefore OTA computable, SVM

regressors, focusing on Lipschitz-continuous losses. We say that the loss Lis B-Lipschitz-

continuous if L(x, y, ·) is Lipschitz-continuous for all x∈Xand y∈Ywith a Lipschitz

constant uniformly bounded by B. Lipschitz-continuity of a loss function is a property

that is also often needed in other contexts. Fortunately, many loss functions of practical

interest possess this property. For instance, the absolute distance loss, the logistic loss,

the Huber loss and the ε-insensitive loss, all of which are commonly used in regression

problems [SC08, Section 2.4], are Lipschitz-continuous. Even in scenarios in which the

naturally arising loss is not Lipschitz-continuous, for the purpose of designing the ML

model, it is often replaced with a Lipschitz-continuous alternative. For instance, in binary

classification, we have Y={−1,1}and the loss function is given by

(x, y, t)↦→ ⎧

⎨

⎩

0,sign(y) = sign(t)

1,otherwise.

This loss is not even continuous, which makes it hard to deal with. So for the purpose

of designing the ML model, it is commonly replaced with the Lipschitz-continuous hinge

loss or logistic loss [SC08, Section 2.3].

3.1 Support Vector Machines with Additive Kernels for Regression and Classification

Here, we consider the case in which the features are K-tuples and the SVM can be

trained in a centralized fashion. The actual predictions, however, are performed in a

distributed setting; i.e., there are Kusers each of which observes only one component of

the features. The objective is to make an estimate of the label available at the receiver

while using as little communication resources as possible.

To this end, we consider the case of additive models which is described in [CH12,

Section 3.1]. We have X=X1× ··· × XKand a kernel κk:Xk×Xk→Rwith an

associated reproducing kernel Hilbert space Hkof functions mapping from Xkto Rfor

each k∈ {1, . . . , K}. Then by [CH12, Theorem 2]

κ:X×X→R,((x1, . . . , xK),(x′

1, . . . , x′

K)) ↦→ κ1(x1, x′

1) + ···+κK(xK, x′

K) (3.1)

is a kernel and the associated reproducing kernel Hilbert space is

H:= {f1+···+fK:f1∈H1, . . . , fK∈HK}.(3.2)

So this model is appropriate whenever the function to be approximated is expected to

have an additive structure. We know [SC08, Theorem 5.5] that an SVM estimator has the

form

f(x) =

∑︂

j=1

αjκ(x, xj),(3.3)

where α1, . . . , αn∈Rand x1,...xn∈X. In our additive model, this is

f(x1, . . . , xk) =

∑︂

k=1

fk(xk),(3.4)

where for each k,

fk(xk) =

∑︂

j=1

αjκk(xk, xj

k).(3.5)

We now state a result for the distributed approximation of the estimator of such an

additive model as an immediate consequence of Theorem 2.

Corollary 5. Consider an additive ML model, i.e., we have an estimator of the form

(3.4), and assume that Lis a B-Lipschitz-continuous loss. Suppose further that all the

fKhave bounded range such that the quantities ∆

¯(f)and ∆(f)as defined in (2.9) and

(2.10) exist and are finite. Let ε, δ > 0and M≥M(f, ε, δ)as defined in (2.14), where

3 Applications of Distributed Function Approximation to Vertical Federated Learning

Φ := id and thus Φ−1(ε) = ε. Then, given xK= (x1, . . . , xK)drawn from an arbitrary

distribution13,14 at the transmitters and any y∈Y, through Muses of the channel (2.2),

the receiver can obtain an estimate f

¯of f(xK)satisfying

P(L(xK, y, f

¯)−L(xK, y, f(xK))≥Bε)≤δ. (3.6)

Proof. The Lipschitz continuity of Lyields

P(L(xK, y, f

¯)−L(xK, y, f(xK))≥Bε)≤P(f

¯−f(xK)≥ε),

from which (3.6) follows by the definition of M(f, ε, δ).

While [CH12] provides some examples of applications of SVMs with additive kernels to

regression problems, the example for anomaly detection described in [RGS16] can be re-

covered as a special case of the framework described in this subsection, where the employed

SVMs have linear kernels.

We conclude this subsection with a brief discussion of the feasibility of the condition

that f1, . . . , fKhave bounded ranges in the case of the additive SVM model discussed

above. The coefficients α1, . . . , αnare a result of the training step and can therefore be

considered constant, so all we need is that the ranges of κ1, . . . , κKare bounded. This

heavily depends on X1,...,XKand the choices of the kernels, but we remark that the

boundedness criterion is satisfied in many cases of interest. The range of Gaussian kernels

is always a subset of (0,1], and while other frequent choices such as exponential, polynomial

and linear kernels can have arbitrarily large ranges, they are nonetheless continuous which

means that as long as the input alphabets are compact topological spaces (e.g., closed

hyperrectangles or balls), the ranges are also compact, and therefore bounded.

3.2 Model-Agnostic Approach to Over-the-Air-Computed

Classifiers

In this subsection, we focus on classification problems. The general approach is model-

agnostic in the sense that arbitrary and even different ML models can be used in the

distributed agents, but we have decentralized classifiers with a low computational bur-

den in mind, as is exemplified in the numerical simulations discussed in the following

subsection.

13Arbitrary distribution means in particular that the components can be arbitrarily correlated.

14Note that Theorem 2 actually provides for a stronger result, since it allows arbitrary deterministic

values, which implies the applicability to arbitrarily distributed random variables through the law of

total probability.

3.2 Model-Agnostic Approach to Over-the-Air-Computed Classifiers

We consider a feature alphabet X=X1×···×XKand a label alphabet Y={−1,1}as

well as an unknown, but fixed probability distribution Pon X×Y. In the training phase,

each user kobserves Jtraining samples

Tk=(︂(︂x(1)

k, y(1))︂,...,(︂x(J)

k, y(J))︂)︂,

where for all k, j, we have x(j)

k∈Xk,y(j)∈Yand (x(j)

1, . . . , x(j)

K, y(j)) is drawn according

to P.

Each user kcan train its own model based on Tkwhich is distributed according to

the marginal of Pwith respect to Xk×Y. We propose to use a slight variation of the

well-known boosting technique and define a classifier

f:=

∑︂

k=1

αkfk,(3.7)

where fkis the base classifier locally trained at user kand αkis a nonnegative weight. As

an immediate corollary to Theorem 2 parallel to Corollary 5, fcan be approximated at a

central node in a distributed manner.

Corollary 6. Assume that Lis a B-Lipschitz-continuous loss. Let ε, δ > 0and M≥

M(f, ε, δ)as defined in (2.14), where Φ−1(ε) = ε, noting that

∆

¯(f)=2

∑︂

k=1

αk,∆(f)=2 K

max

k=1 αk.

Then, given any xK= (x1, . . . , xK)drawn from an arbitrary13,14 distribution at the trans-

mitters and any y∈Y, through Muses of the channel (2.2), the receiver can obtain an

estimate f

¯of f(xK)satisfying

P(L(xK, y, f

¯)−L(xK, y, f(xK))≥Bε)≤δ. (3.8)

The proof is the same as for Corollary 5.

It is important to remark here that the predictor fcan only be approximated at the

receiver up to a residual error (which can, however, be controlled) and thus, a guarantee

in terms of the 0-1-loss is not sufficient to apply Corollary 6 and we instead need it to be

in terms of a Lipschitz-continuous loss.

This is a relatively generic framework that can in principle work with any particular

boosting technique which determines weights α1, . . . , αKand guarantees a bound on the

loss of the predictor fdependent on the errors of the base classifiers f1, . . . , fK. In the

3 Applications of Distributed Function Approximation to Vertical Federated Learning

following, we describe two variations of this general approach, both based on well-known

ideas from ML (cf., e.g., [MRT12, Chapter 6]).

The first one, equal majority vote, amounts to setting α1=··· =αK= 1 and using

local classifiers fktrained only on the locally available features. This method has the

advantage that the whole training procedure can be carried out in a fully decentralized

way without any form of coordination or exchange of information between the agents

(given that the labels for the training phase are already known everywhere; but they

could, e.g., be broadcast from a central point at a cost independent of the number of

agents or dimensionality of the feature space).

If we have the possibility to exchange some data between the agents, we can use the

following adaptation of the AdaBoost scheme [MRT12, Figure 6.1] to the distributed

setting. The algorithm runs through L≤Kiterations, choosing a user hℓat iteration ℓto

provide a base classifier fhℓand assigning a corresponding weight αhℓ. It also computes

probability distributions p1, . . . , pL+1 on the index set of the training data {1, . . . , J},

initializing p1as the uniform distribution, as well as base classifier errors ϵ1, . . . , ϵLand

normalization constants Z1,...,ZL. Each iteration ℓconsists of the following steps:

1. The central node chooses a user hℓand broadcasts the choice.

2. User hℓtrains a base classifier fhℓ:Xhℓ→ {−1,1}on the training sample with dis-

tribution pℓand broadcasts the indices of the training samples incorrectly classified

by fhℓ.

3. From this information, every node in the system computes the following, where 1·

denotes the indicator function which is 1 if the condition in the index is true and 0

otherwise:

•ϵℓ:= ∑︁J

j=1 pℓ(j)1fhℓ(x(j)

hℓ)=y(j)

•αhℓ:= 1

2log 1−ϵℓ

ϵℓ

•Zℓ:= 2√︁ϵℓ(1 −ϵℓ))

•pℓ+1(j) := pℓ(j) exp(−αhℓfhℓ(x(j)

hℓ)y(j))/Zℓ

The resulting classifier is then as defined in (3.7), where we assign αk:= 0 whenever k=hℓ

for all ℓ. [MRT12, Theorem 6.1] guarantees that the empirical 0-1-loss of fis at most

exp (︄−2

∑︂

ℓ=1 (︃1

2−ϵℓ)︃2)︄,(3.9)

which unfortunately is insufficient to apply Corollary 6, because the 0-1-loss is not Lipschitz-

continuous. However, the proof of the theorem relies only on the inequality 1f(xK)y≤0≤

3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers

exp(−f(xK)y) for the instantaneous 0-1-loss. Since the inequality log(1+exp(−f(xK)y)) ≤

exp(−f(xK)y) also clearly holds, we can replace the 0-1-loss in the proof with the logis-

tic loss L(xK, y, yˆ) := log(1 + exp(−yyˆ)) (or, indeed, any other loss which satisfies this

inequality). This yields the same bound (3.9) on the 1-Lipschitz-continuous logistic loss

and thus we can apply Corollary 6 with B:= 1 to derive a guarantee on the logistic loss

of the distributed approximation of our AdaBoost classifier.

We conclude with some remarks on the distributed training. The choice in step 1 could,

e.g., be predetermined (in which case no communication in this step is necessary) or ran-

dom, but we could also greedily select the classifier with smallest error using an instance of

ScalableMax [9] [2, Section IV]. As for the communication cost of the distributed training,

step 1 exhibits a favorable scaling which is linear in Land logarithmic in K, however, step

2 has a cost linear in the number of training samples. There is a conceptually simpler

alternative to this distributed scheme in which we communicate the full training set to

the central node and perform the training in a centralized manner. The advantage in

communication cost of the distributed scheme over this centralized alternative is only a

constant factor. On the other hand, since only one bit per training sample and user is

transmitted, this constant gain could potentially be quite large, depending on the com-

plexity of the feature spaces. Also, in the distributed training scheme, the computational

load of training the base classifiers is distributed across all nodes, which may in practice

also be an advantage wherever the computational capabilities of the central node are lim-

ited. However, since the distributed training currently leverages no OTA computation and

leaves that for the computation of the trained classifier itself, finding a distributed scheme

which can exploit OTA computation to achieve a gain in asymptotic behavior as opposed

to only a constant factor could be a worthwhile question for future research.

3.3 Numerical Results for Over-the-Air-Computed Decision Tree

Classifiers

In order to illustrate how the scheme analyzed in this work can be used to compute

classifiers for anomaly detection problems in large sensor networks, we have conducted

numerical simulations on a synthetic binary classification problem generated by the

make classification function in the datasets package of the scikit-learn toolbox [P+11]

for Python. It places clusters for the two classes at the edges of a hypercube in a Euclidian

space of informative features, adds redundant features that are linear combinations as well

as useless features that are pure noise and applies various kinds of noise and nonlinear-

ities. The resulting features are then shuffled randomly, partitioned and assigned to the

distributed agents. For the training set, the agents also learn the correct corresponding

3 Applications of Distributed Function Approximation to Vertical Federated Learning

labels. We construct two different OTA-computed classifiers as described in the preceding

subsection:

1. For the equal majority vote classifier, each agent trains a decision tree model of

height at most 2 based on the locally available features only. The OTA-computed

classifier is then as put forth in equation (3.7), where α1=··· =αK= 1.

2. For the AdaBoost classifier, the agents train their models cooperatively as described

in the preceding section, using decision tree classifiers as the local base classifiers.

The next agent at each iteration is picked uniformly at random from among the

agents which have not yet been selected. This procedure yields not only differ-

ently trained local models compared to the equal majority vote, but also weights

α1, . . . , αKwhich can be used for the OTA-computed classifier as in equation (3.7).

We assume that the agents are connected to a central receiver through a fast-fading wireless

MAC, where no instantaneous CSI is available. The only kind of information we assume

is available is the average power of the complex Gaussian channel gain at the transmitters

and the average power of the additive noise at the receiver. The distributed classification is

simulated for noise and fading drawn from i.i.d. Gaussian distributions and for a scenario

exhibiting various degrees of correlation and non-Gaussian components:

•For the fading, we achieve this by passing the fading coefficients through a lowpass

filter, which cuts off all but a given percentage of the energy (the cutoff percentage)

and then re-normalizes the remaining signal.

•For the noise, we simulate Middleton class A noise (also called impulsive noise), which

is a commonly used noise model for power line communications [FLNS10, Section

2.6.3.1] but has also been empirically found to be a relevant phenomenon in wireless

communications [MS93]. We simulate it as described in [FLNS10, eq. (2.49) ff.]: In

order to create one sample of noise, a random variable mis drawn from a Poisson

distribution with intensity A, a parameter called the impulsive index, and then a

centered Gaussian random variable with variance

2PN

m/A+Γ

1 + Γ

is drawn, where PNis the overall power of the noise per complex dimension and

the parameter Γis called the Gaussian-to-impulsive power ratio. Finally, a phase

shift is applied drawn uniformly from the complex unit circle. This process defines

non-Gaussian, but i.i.d. noise. We have therefore modified it slightly and draw one

3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers

mfor every 4 channel realizations so that we create a correlation also in the additive

noise.

We simulate the computation of each of the two distributed classifiers described above in

three different ways:

1. The DFA scheme as described in Section 2.5.

2. A TDMA scheme with average power normalization in which only one of the agents

can transmit at a time. The information is also transmitted in analog form, since

each agent conveys only one bit of information and therefore digital coded schemes

would not be suitable. Since the agents transmit during a much shorter time than in

the DFA scheme, we normalize their transmission power so that the average energy

consumed per channel use equals that in the DFA scheme. The only exception to

this is the case when some agents have to be allocated zero channel uses, since the

number of total channel uses available is smaller than the number of agents in the

system. In this case, obviously, the agents allocated zero channel uses also have

zero energy consumption. This scheme has a significantly higher peak transmission

power than the DFA scheme.

3. A TDMA scheme with peak power normalization. It works as the one under item 2,

but the transmission power is normalized so that it has the same peak power as the

DFA scheme, which means that it has a significantly lower average consumption.

We also show two baselines to make the error contribution of the compared communication

schemes clearer:

1. a noiseless version of the majority vote classifier, and

2. a noiseless version of the AdaBoost-inspired classifier.

The training set consists of 50,000 samples and the test set of 200,000. We have

generated two different binary classification problems, one for 10 transmitters and one for

25 transmitters.

Comparison of DFA and TDMA schemes for equal majority vote and AdaBoost We

have simulated the DFA scheme as well as the TDMA baseline comparisons for a scenario

with moderate correlation and non-Gaussianity. The cutoff percentage for the lowpass

filter applied to the fading was chosen at 80%, and the parameters for the Middleton

Class A noise were A= 3 and Γ= 3. In Fig. 3.1, we plot the classification error on

the test set as a function of the number of complex channel uses for a fixed SNR of −6

3 Applications of Distributed Function Approximation to Vertical Federated Learning

50 100 150 200 250

0.05

0.1

0.15

0.2

0.25

0.3

DFA gain

number of complex channel uses

test error

equal majority DFA

AdaBoost DFA

equal majority TDMA average-normalized

AdaBoost TDMA average-normalized

equal majority TDMA peak-normalized

AdaBoost TDMA peak-normalized

AdaBoost noiseless

equal majority noiseless

Figure 3.1: Comparison of the classification error on the test set of DFA/TDMA and equal

majority/AdaBoost schemes. 10 transmitters, cutoff percentage 80%, A= 3,

Γ= 3, SNR −6 dB.

3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers

−20 −10 0 10 20

0.04

0.06

0.08

0.1

0.12

0.14

DFA error

floor gain

DFA error floor gain

SNR

test error

Figure 3.2: Comparison of the classification error on the test set of DFA/TDMA and equal

majority/AdaBoost schemes. 10 transmitters, cutoff percentage 80%, A= 3,

Γ= 3, 50 complex channel uses.

50 100 150 200 250

0.1

0.2

0.3

0.4

DFA gain DFA gain

number of complex channel uses

test error

Figure 3.3: Simulation results for 25 transmitters at a fixed SNR of −6 dB, correlation

parameters are the same as in Fig. 3.1.

3 Applications of Distributed Function Approximation to Vertical Federated Learning

dB and 10 transmitters, and in Fig. 3.2, we plot the error as function of SNR for a fixed

number of 50 complex channel uses. We can see that, as the effect of the multiplicative

fading dominates that of the additive noise, the schemes reach an error floor that cannot be

lowered with an increase of the transmission power. When the number of complex channel

uses is increased, on the other hand, the error curves approach the noiseless classification

error even if the SNR is kept fixed.

For instance, to obtain a classification error of 0.07 or better, both for the equal majority

vote and AdaBoost, the average-power normalized TDMA scheme needs over 30 channel

uses more than the DFA scheme. Since we compare with a TDMA scheme that uses

the same average energy per channel use as the DFA scheme, this means that the TDMA

scheme not only consumes more wireless spectrum and/or time, but also significantly more

energy. For the case of the same peak power consumption (which means that TDMA

consumes less average power since transmitters are silent most of the time), the difference

is huge and can be several hundred channel uses depending on the error level.

The advantage of the DFA scheme over the TDMA alternatives is quite pronounced

even at a relatively low number of only 10 transmitters. In Fig. 3.3, we show the same

plot as in Fig. 3.1, but for a different ML problem with 25 transmitters. It can be seen

in the plot that as the number of transmitters increases, the difference in performance

between the DFA and TDMA schemes becomes even stronger. This is due to the different

scaling behaviors of the schemes.

We have run these simulations for many instantiations of the randomly generated clas-

sification schemes (not depicted for lack of space) and note that while in some cases the

equal majority vote scheme performs similarly as the AdaBoost scheme in the noiseless

case, in many cases the error behavior of the AdaBoost scheme is much better and it is

more robust, e.g. in the case that a large number of agents observes only useless features

while only few agents observe the informative and repetitive features that can be used

to solve the classification problem. That being said, in the case in which the equal ma-

jority vote performs similarly to AdaBoost in the noiseless case, its error behavior in the

communication schemes is better since it better utilizes the available peak transmission

power.

Synchronization errors Since the pre-processing described in Section 2.5.1 creates a se-

quence of i.i.d. random variables (conditioned under s1, . . . , sK), we can expect that the

scheme is quite robust to synchronization errors between the transmitters. This is im-

portant since perfect synchronization of the transmitted signals at the receiver would be

a very hard task to achieve in practice. In order to substantiate this argument, we have

run simulations with relatively large synchronization errors of several symbol durations:

3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers

50 100 150 200 250

0.05

0.1

0.15

0.2

0.25

0.3

number of complex channel uses

test error

perfect synchronization

synchronization errors of 1, 5, 10, 15,

20 and 25 channel uses

noiseless

Figure 3.4: Impact of synchronization errors on the DFA AdaBoost scheme in the scenario

of Fig. 3.1.

3 Applications of Distributed Function Approximation to Vertical Federated Learning

For the starting time of the transmission for each transmitter, we have added a uniformly

random number of channel uses in a range of up to 1, up to 5, up to 10, up to 15, up

to 20 and up to 25 channel uses. In Fig. 3.4, we show the impact of the synchronization

errors for the AdaBoost DFA scheme and the same choice of parameters as for Fig. 3.1.

The solid red curves (representing the case of perfect synchronization) are therefore the

same in both figures while the blue curves in Fig. 3.4 depict the performance for various

values of the maximum synchronization error. The performance degrades gracefully even

for extremely large synchronization errors and the number of additional channel uses that

needs to be expended to maintain the same classification error is about twice the value of

the maximum synchronization error. We remark that we expect the number of additional

channel uses that needs to be expended to scale with the synchronization error and not

with the number of transmitters. Moreover, for synchronization errors of the same order

of magnitude as the symbol duration, the drop in performance is barely noticeable.

Comparison of different correlation scenarios In order to get an idea of how strongly

the correlation and non-Gaussianity impact on the performance of the scheme, we have

compared the AdaBoost DFA scheme (the solid red curve from Fig. 3.1) for various choices

of the correlation and non-Gaussianity parameters. Qualitatively, the higher the values

of Aand Γ, the more closely the additive noise is to a Gaussian distribution and the

lower the values, the stronger pronounced is the non-Gaussianity of the noise. For the

fading, a cutoff percentage of 100% corresponds to i.i.d. Gaussian fading, while lower

cutoff percentages mean that the fading changes more slowly over time. The results of

our comparison in Fig. 3.5 show that the scheme performs best for the Gaussian i.i.d.

case but, as is expected from the theoretical analysis, the exponential decay of the error

is retained even for strongly pronounced correlation and non-Gaussianity.

3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers

50 100 150 200 250

0.05

0.1

0.15

0.2

0.25

0.3

number of complex channel uses

test error

i.i.d. Gaussian fading and noise

cutoff percentage 80%, A= 3, Γ= 3

(same as Fig. 3.1)

cutoff percentage 80%, A= 0.5, Γ= 1

cutoff percentage 50%, A= 0.25, Γ= 1

cutoff percentage 30%, A= 1, Γ= 0.5

noiseless

Figure 3.5: Comparison of the performance of AdaBoost DFA for various choices of the

non-Gaussianity and correlation parameters.

4 Security in Over-the-Air Computation

OTA computation schemes carry the promise that they can improve communication ef-

ficiency so dramatically in many cases of practical interest that they can be seen as an

enabler for applications in massive wireless networks for which the communication cost

or the time delay incurred would otherwise be prohibitive. However, there is also a flip

side that has the potential to hinder widespread adoption: Some tools that enhance the

properties of communication and are frequently used as building blocks in communication

systems inherently rely on the principle of source-channel separation. Therefore, they can-

not be adapted to work in a scenario where a joint source-channel approach is taken such

as in OTA computation. One example of such a building block that is particularly impor-

tant in modern communication systems is cryptography. OTA communication schemes

as described in Chapter 2 are vulnerable to a number of attacks such as malicious trans-

mitters participating in the scheme or attackers eavesdropping on the transmission, and

it is unclear whether and how state-of-the-art cryptographic security could be adapted to

defend against such threats.

At least for the latter kind of threat – attackers eavesdropping on the communication –

information-theoretic security, while not adaptable in a straightforward fashion, provides

a set of tools with which a defense can be developed. The ultimate goal in this direction

should be full semantic security [BTV12]. As a first step, we propose to extend the system

model with a jammer as depicted in Fig. 4.1. This shows how information-theoretic

security tools can be adapted to the OTA computation setting. The jammer can increase

the variance of the eavesdropper’s estimate of the quantity of interest, but not fully prevent

it from obtaining an estimate.

The key assumption we make is that the received jamming signal must be stronger for

the legitimate receiver than it is for the eavesdropper. This way, the legitimate receiver

Bcan exploit the dependencies which we carefully introduce into the jamming signal to

reconstruct it exactly. To the eavesdropper, the received signal is almost equivalent to an

i.i.d. jammed transmission. With the knowledge of the full jamming signal, the legitimate

receiver can then cancel it from its received signal or at least mitigate its impact. The

approximation of the OTA computed function value at the legitimate receiver can then be

carried out independently of the security scheme. It can, e.g., follow the method described

4 Security in Over-the-Air Computation

M-fold

channel

DMB

YMf

Figure 4.1: System model for distributed function approximation with security constraints.

in Chapter 2. In the present chapter, we give an example for how to combine the security

and function approximation schemes in the simpler case of an AWGN channel without

fast fading.

While our results on OTA computation rely heavily on the particular structure of wire-

less channels, the class of channels for which the reconstruction of the jamming signal is

possible is much more general. This will become apparent in the following.

We are not aware of a similar system model having been proposed before for OTA

computation, but we draw heavily from existing tools in information theory. As the main

building blocks for proving security guarantees in a DFA scheme with friendly jamming, we

use two information theoretic tools, namely coding for the compound channel and channel

resolvability. The latter ensures that the jamming signal can be reconstructed fully by

the legitimate receiver15 and thus be canceled from the received signal. Therefore, it

has no impact on the quality of the objective function estimate. Channel resolvability

guarantees that the jamming signal is virtually indistinguishable from white noise for the

eavesdropper. This virtual indistinguishability is phrased in terms of variational distance

between the actually observed distribution and a superposition of the transmitters’ signal

with white noise from the jammer. The coding result for the continuous compound channel

which we derive may also be of interest as an independent result.

4.1 Prior Work

To the best of our knowledge, the OTA computation problem over a wiretap channel

has not yet been considered in the literature. Therefore, in this subsection we briefly

summarize the literature on the building blocks other than OTA computation we use for

15For a formal statement, see Definition 12.

4.1 Prior Work

the approach to the wiretap OTA computation channel that we propose in this work as

well as for literature on concepts that are closely related to the ones presented in this

paper.

Coding for compound channels. The compound channel problem was introduced in-

dependently in [Dob59, BBT59, Wol59], while first independent results for the capacity

expression can be found in [BBT59, Wol59]. These works, however, explore mainly the

case of finite input and output alphabets. The semicontinuous case in which only the

input alphabet is assumed to be finite is briefly touched upon in [Wol59] and studied in

more detail in [Kes61] which provides an example showing that the capacity expression

from the finite case does not carry over to the semicontinuous case in general. The semi-

continuous case was further explored in [Yos65,Ahl67]. In many cases of practical interest,

the capacity expression from the finite case can be generalized to the continuous case in

which neither input nor output alphabets are assumed to be finite, as was found in [RV68]

for a class of Gaussian compound channels.

Channel Resolvability and Semantic Security. The concept of channel resolvability was

introduced in [Wyn75a, HV93]. Further results relevant in the context of this work ap-

peared, e.g., in [Csi96,Dev05,HM16,Cuf16]. We use our generalization proposed in The-

orem 6 for continuous channels as a basis for our proposed scheme. Although we cannot

provide full semantic security guarantees in this work, we also heavily draw from the

idea of obtaining semantic security by means of channel resolvability, which is laid out

in [Hay06,CK11,BTV12,BL13].

Friendly Jamming The idea of friendly jamming has been used in [NG05] to aid a

transmitter-receiver pair in protecting a point-to-point transmission from a passive eaves-

dropper. Distributed and centralized beamforming techniques are used so that the jam-

ming signal impacts the signal-to-noise ratio at the eavesdropper but not at the legitimate

receiver. Several more recent works (cf., e.g., [VBBM10,VBBM11,SY12]) have expanded

upon this idea and refined the friendly jamming techniques, but to the best of our knowl-

edge, they have not yet been used to protect OTA computation against eavesdropping.

Physical Layer Security The concept of information theoretic secrecy was introduced

in [Sha49] and the wiretap channel model together with a weaker, but more tractable

notion of secrecy was introduced in [Wyn75b]. Based on this, various stronger secrecy

notions have been introduced and investigated (e.g., [Mau94,HK14,BTV12]). All of these

existing works investigate how digitally coded transmissions can be protected against

4 Security in Over-the-Air Computation

M-fold

channel

DMB

YMf

WBM

WEM

Figure 4.2: System model for distributed function approximation with jamming described

in Section 4.2.2.

eavesdropping, while in the present work, we focus on uncoded analog transmissions over

multiple-access channels.

4.2 System Model

4.2.1 Distributed Approximation of Functions

The system model and problem statement for DFA are the same as in Section 2.2.1.

Depending on the application at hand, there are multiple ways in which the quality of

the estimate f

˜can be quantified. Besides the notion of ε-approximation of a function at

confidence level δdefined in (2.5), we also define the approximation by criterion of the

mean square error (MSE).

Definition 11. We say that fis V-MSE-approximated if, under a uniform distri-

bution of f(s1, . . . , sK), we have

E(︃(︂f

˜−f(s1, . . . , sK))︂2)︃≤V,

where the expectation is over the joint distribution of s1, . . . , sKand f

˜which is

induced by the distributed function approximation scheme and the channel.

4.2 System Model

4.2.2 Secrecy Extension to Distributed Approximation of Functions

In order to incorporate security aspects into the framework, we consider the extended

system model depicted in Fig. 4.2.

The first addition to the model is an attacker Ewhich attempts to eavesdrop on the

transmission and would like to gain knowledge about s1, . . . , sK. At each channel use, E

observes an output Zranging over the eavesdropper’s alphabet Z. As a counter-measure,

we add a friendly jammer Jwhich transmits some jamming sequence XMwith the ob-

jective to prevent Efrom obtaining information while still allowing Bto obtain a good

estimate of f(s1, . . . , sK).

Definition 12. A DFA scheme with jamming consists of:

•A distributed function approximation scheme as described in Section 2.2.1; i.e., pre-

and post-processing schemes

•A jamming strategy given by a probability distribution on XM.

We say that a distributed function approximation scheme with jamming allows reconstruc-

tion of the jamming signal with probability δif there is a decoding function ϑ:YM→ XM

such that

sup

s1∈S1,...,sK∈SK

Ps1,...,sK(︁ϑ(YM)=XM)︁≤δ

and δis the smallest number with this property.

The objective is to find admissible pre- and post-processing strategies as well as a

jamming strategy such that Bcan obtain a good approximation f

˜of f(s1, . . . , sK) while

bounding the usefulness of any information that Ecan obtain about s1, . . . , sK.

Together with the channel, a distributed function approximation scheme with jamming

induces a probability distribution R

ˆs1,...,sKon ZMfor each (s1, . . . , sK)∈ S. How secure

the scheme is depends on how strongly R

ˆs1,...,sKdepends on s1, . . . , sK. In the following,

we formalize this notion.

Any measurable function g:S → S

ˆ, where S

ˆis a measurable space, is called an

eavesdropper’s objective.

Definition 13. 1. Given a real number η≥0, we say that a distributed function ap-

proximation scheme with jamming is η-semantically secure if there is a probability

measure νon ZMsuch that for all (s1, . . . , sK)∈ S,



R

ˆs1,...,sK−ν



TV ≤η, (4.1)

4 Security in Over-the-Air Computation

where ∥µ∥TV := sup{µ(E) : Eis an event}denotes the total variation norm on signed

measures.

2. Let g:S → S

ˆ, where S

ˆ⊆Ris measurable and bounded, be an eavesdropper’s

objective. Let V≥0be a real number. We say that a distributed function approxi-

mation scheme with jamming is (g, V )-MSE-secure if under a uniform distribution

of g(s1, . . . , sK), for every estimator d:ZM→ S

ˆ, we have

E(︂(︁d(ZM)−g(s1, . . . , sK))︁2)︂≥V,

where the expectation is over the joint distribution of s1, . . . , sKand YMwhich results

from the application of the distributed function approximation scheme with jamming

and the channel.

In statistical terms, that a scheme is (g, V )-MSE-secure means that all estimators the

eavesdropper can apply have MSE at least Vunder a uniformly distributed objective. A

uniform distribution of the objective means that s1, . . . , sKare randomly distributed in

such a way that g(s1, . . . , sK) follows a uniform distribution.

In a sense made explicit by the following lemma, semantic security is the stronger of

the two security notions from Definition 13.

Lemma 11. Let S

ˆ:= [a, b], let g:S → S

ˆbe an eavesdropper’s objective and η≥0a real

number.

Then, any distributed function approximation scheme with jamming that is η-semantically

secure is also (g, (1/12 −η)(b−a)2)-MSE-secure.

Proof. Let d:ZM→ S

ˆ. Then, assuming the distribution of s1, . . . , sKcorresponds to a

uniform distribution on [a, b] of g(s1, . . . , sK), we have

Es1,...,sKER

ˆs1,...,sK(︂(︁d(ZM)−g(s1, . . . , sK))︁2)︂

=Es1,...,sK∫︂(b−a)2

ˆs1,...,sK(︃(︂d(ZM)−g(s1, . . . , sK))︂2> a)︃da

(4.1)

≥Es1,...,sK∫︂(b−a)2

0(︃ν(︂(︁d(ZM)−g(s1, . . . , sK))︁2> a)︂−η)︃da

=Es1,...,sKEν(︂(︁d(ZM)−g(s1, . . . , sK))︁2)︂−η(b−a)2

≥(︃1

12 −η)︃(b−a)2,

where the last step is because under ν,ZMis independent of s1, . . . , sK. Therefore, the

posterior distribution of g(s1, . . . , sK) given ZMis uniform on [a, b], which implies that

4.2 System Model

the minimum MSE is the variance of the uniform distribution. Denoting with Ua random

variable distributed uniformly in [a, b], we can calculate its variance as

E(︁U2)︁−(EU)2=∫︂b

b−adg −(a+b)2

=b3−a3

3(b−a)−(a+b)2

=b2+a2+ab

3−a2+ 2ab +b2

=4b2+ 4a2+ 4ab −3a2−6ab −3b2

12(b−a)2.

4.2.3 Special case K= 1

We conclude this section with a brief discussion of the important special case K= 1. While

one of the main motivations of the methods developed in this paper is their scalability to

large values of K, the case of low values of Kcan also be interesting in many practical

applications and be instructive to understand the nature of our results better.

For the special case of only a single transmitter (K= 1), the problem reduces to a

point-to-point transmission of the real number f(s1) in the presence of an eavesdropper

and a friendly jammer. In our results in this paper, there is no assumption that Khas to

be large; in particular, they remain applicable also when K= 1. However, since in this

case no function of distributed values has to be computed over the channel, it is possible

to separately source and channel encode f(s1). After the source coding step has been

performed, the remaining problem is very similar to jammer-aided secret communication

as treated for instance in [NG05,VBBM10,VBBM11].

But although this approach is applicable to the same communication task, it is important

to note that the way in which the friendly jammer has to be placed differs significantly. In

the approach of this paper, the jamming signal has to be stronger at the legitimate receiver

than it is at the eavesdropper. As long as this condition is satisfied, the legitimate receiver

has the ability to almost completely cancel the jamming signal. This means that our

method remains applicable even if the gap in terms of jammer signal strength between the

legitimate receiver and the eavesdropper is relatively small. In [NG05,VBBM10,VBBM11],

on the other hand, it is necessary that the jamming signal is stronger at the eavesdropper

than it is at the legitimate receiver. Moreover, this gap between signal strengths has to be

as large as possible since the jammer’s signal strength at the legitimate receiver diminishes

the capacity of the main channel. Therefore, our results in this case are more suitable for

4 Security in Over-the-Air Computation

scenarios where is is possible to assure a high jamming signal strength at the legitimate

receiver while results from [NG05, VBBM10, VBBM11] are more suitable in cases where

all possible eavesdropper locations can be covered with strong jamming signals that have

very low strength at the location of the legitimate receiver.

4.3 Specialization to the Additive White Gaussian Noise

Channel

In general, the approximation scheme even without an eavesdropper or jammer highly

depends on the particular structure of the channel and f. It is therefore instructive to

consider a specialization of the DFA framework with jamming to the computation of

arithmetic means over AWGN channels. Specifically, the objective function is given as

f: (s1, . . . , sK)↦→ 1

∑︂

k=1

sk,(4.2)

where for all k,Sk= [−1,1]. The channel is given by

Y=hAB

∑︂

k=1

Tk+hJBX+NB(4.3)

Z=hAE

∑︂

k=1

Tk+hJEX+NE,(4.4)

where each of the transmitters A1,...,AKis subject to a total power constraint of PA,J

is subject to an average power constraint PJ,NBis centered normal with variance σ2

Band

NEis centered normal with variance σ2

E. The real channel coefficients hAB, hJB, hAE, hJE

are assumed deterministic and known everywhere. The channel is used Mtimes with

transmitter input sequences TM

kfor each k∈ {1, . . . , K}and XMfor the jammer. The

input sequences are subject to the average power constraints

∑︂

m=1

(Tk,m)2≤PA,1

∑︂

m=1

(Xm)2≤PJ.

The problem is easily approached if we can assume that Bhas full knowledge of XM,

while Eknows only how XMis distributed.

In this case, we have the following result.

Lemma 12. Consider the wiretap channel given by (4.3) and (4.4) and the objective

4.3 Specialization to the Additive White Gaussian Noise Channel

0 1 2 3 4

0.1

0.2

0.3

0.4

σ2

σ2Ψ(2/σ)

Figure 4.3: Illustration of the MSE guarantees of Lemma 12. The dashed line is the MSE

which an eavesdropper would have without any received signal (i.e., guessing

the middle of the interval).

function fdefined in (4.2). Define

σ2

eff,B:= σ2

ABK2PA

, σ2

eff,E:= σ2

E+h2

JEPJ

AEK2PA

Assume that the jamming sequence XMis perfectly known at the legitimate receiver

while the eavesdropper has only statistical information. Define

Ψ(a) := ∫︂a

0∫︂∞

−∞ (︃c+φN(−c)−φN(a−c)

ΦN(a−c)−ΦN(−c)−b)︃21

aφN(b−c)dcdb, (4.5)

where φNdenotes the probability density function and ΦNthe cumulative distribution

function of the standard normal distribution, respectively. Then there is a distributed

function approximation scheme with jamming which is (f, σ2

eff,EΨ(2/σeff,E))-MSE-secure

and (σ2

eff,BΨ(2/σeff,B))-MSE-approximates fat the receiver.

The proof is based on a few facts from statistics. We only state the relevant lemmas

here. For the sake of completeness, we include the proofs of the following two lemmas in

Section 4.6.1.

Lemma 13. If Uis distributed uniformly in [a, b]and, conditioned on U,V1,...,VM

are i.i.d. normally distributed with mean Uand variance σ2, then the minimum MSE

4 Security in Over-the-Air Computation

estimator for estimating Ufrom the observations V1,...,VMis

ˆ:= V

¯+σ

√M·

φN(︂a−V

σ/√M)︂−φN(︂b−V

σ/√M)︂

ΦN(︂b−V

σ/√M)︂−ΦN(︂a−V

σ/√M)︂,(4.6)

where V

¯:= 1

M∑︁M

m=1 Vm.

Lemma 14. Under the assumptions of Lemma 13, the estimator U

ˆsatisfies

E(︃(︂U −U

ˆ)︂2)︃=σ2

MΨ(︃b−a

σ/√M)︃,

with Ψas defined in (4.5).

Proof of Lemma 12. We use the following transmission strategy:

Xm: Gaussian with mean 0 and variance PJ,(4.7)

k:sk↦→ (1,...,1) ·sk√︃PA

M(4.8)

The receiver can obtain

Y′

m:= Ym−hJBXm

hABK√︁PA/M

=hAB ∑︁K

k=1 Tk+NB,m

hABK√︁PA/M

=hAB ∑︁K

k=1 sk√︁PA/M +NB,m

hABK√︁PA/M

=f(s1, . . . , sK) + NB,m

hABK√︁PA/M .

We define the post-processing operation DMat the receiver as first obtaining Y′

1, . . . , Y ′

and then computing the MSE estimator from Lemma 13. With this choice, Lemma 14

yields the claimed reconstruction error guarantee.

On the other hand, the output at Eis given by

(4.4)

=hAE

∑︂

k=1

Tk+hJEXm+NE,m

(4.8)

=hAE

∑︂

k=1

sk√︁PA/M +hJEXm+NE,m

4.4 Main Results

(4.2)

=f(s1, . . . , sK)·K√︁PA/MhAE +hJEXm+NE,m.

From this, Lemmas 13 and 14 yield the claimed MSE-security of the scheme.

The assumption that the legitimate receiver has full knowledge of the jamming signal

seems quite strong. So the main part of this chapter is devoted to setting out and analyzing

a jamming strategy in Theorem 4 which does not need any form of shared randomness or

additional communication between the friendly jammer and the legitimate receiver. This

jamming strategy has “almost” the same implications on the overall system performance

as the assumption that the legitimate receiver has full knowledge of the jamming signal,

while the eavesdropper only has knowledge about its distribution. In Corollary 9, we

formalize this notion for the AWGN case. In principle, however, this jamming strategy

is not restricted to the AWGN scenario; but in fact, it can be combined with any class

of channel models in which it is possible to cancel or at least mitigate the effect of the

jamming signal if exact knowledge of it is available.

One example where not a full cancellation but at least a good mitigation is possible is

the fast-fading scenario treated in Chapter 2.

4.4 Main Results

In this section, we formally state the main results of this chapter.

A function approximation scheme is specific to a particular channel model, which among

other things influences how the pre- and post-processing operations have to be designed

as well as which class of functions can be approximated. In Section 4.3, we have described

such a scheme for the AWGN channel and only a singleton class of functions, namely

for the arithmetic average, which is a particularly simple case. The strategy for the

legitimate receiver to counter the signal of the friendly jammer given that it is known is

also particularly simple in the AWGN case and can be done perfectly, as we have seen.

The missing part of the secrecy extension, which is the method for the legitimate receiver

to obtain the necessary knowledge of the jamming signal while the eavesdropper cannot,

on the other hand, can be phrased and proven to work in somewhat greater generality. In

order to formally state our main results, therefore, we have to introduce a few technical

concepts first.

For any channel W, we denote the joint input-output distribution under the input

distribution Pby QP,W and the marginal for Yby RP,W . With these conventions, we

define the information density of tuples of elements of the input and output alphabets

4 Security in Over-the-Air Computation

under the channel Wand an input distribution Pas

iP,W (xM;yM) := log dWM(xM,·)

dRM

P,W

(yM).

Correspondingly, the mutual information is defined as

IP,W := EQP,W iP,W (X;Y).

Moreover, given two probability measures µand ν, we define the R´enyi divergence of order

α∈(0,1) ∪(1,∞) between them as

Dα(µ||ν) := 1

α−1log Eµ(︄(︃dµ

dν )︃α−1)︄.

D1(µ||ν) := limα↗1Dα(µ||ν) is the Kullback-Leibler divergence.

Acompound channel is a family (Ws)s∈S of memoryless time-discrete point-to-point

channels with common input alphabet Xand output alphabet Y. The transmitter’s

channel input is passed through a fixed Wsfor the entire block length, but the transmitter

does not control the choice of s, nor is it governed by a probability distribution. In this

work, we assume neither the transmitter nor the receiver knows s. A compound channel

code with block length Mand rate Rconsists of an encoder EM

k:{1,...,exp(MR)} →

XMand a decoder DM:YM→ {1,...,exp(MR)}. We say that it has error probability δ

if under a uniform distribution of M∈ {1,...,exp(MR)}, the following is true: Let YM

be constructed by passing the components of XM:= EM(M) independently through Ws.

Then, we have

sup

s∈S

EMPs(M=DM(YM)) ≤δ,

where δis the smallest number with this property.

Our secrecy scheme will hinge on the capacity of compound channels with possibly

continuous alphabets (such as Gaussian compound channels). As mentioned in Section 4.1,

it is shown in [Kes61] that even in the case that only the output alphabet is countably

infinite, the capacity expressions from the finite case [BBT59, Wol59] do not carry over.

It is therefore clear that an additional assumption on the compound channel is needed.

In existing literature (e.g., [BBT59, RV68]), the problem is often approached by proving

that the compound channel can be approximated by a finite class of channels in which

case classical channel coding techniques such as joint typicality decoding can be adapted

in a straightforward manner. In this work, we choose to directly pose the approximability

of the compound channel by a finite class of channels as an assumption of our coding

4.4 Main Results

theorem. In Section 4.5.1, we justify the usefulness of results involving this assumption

by proving that a large class of practically relevant channels can indeed be approximated

in the sense of the following definition.

Given measures µand ν, we say that µis absolutely continuous with respect to ν, or

µ≪ν, if all ν-null sets are also µ-null sets.

Definition 14. Given a compound channel (Ws)s∈S with input alphabet Xand output

alphabet Y, we say that it can be (η, J)-approximated under a probability distribution P

on Xif there is a sequence (W

ˆη,j)J

j=1 of channels from Xto Ysuch that for every s∈ S,

there is j∈ {1, . . . , J}such that

EPD1(︂Ws(X, ·)||W

ˆη,j(X, ·))︂≤η(4.9)

∃α > 1∀x∈ X :Dα(︂Ws(x, ·)||W

ˆη,j(x, ·))︂<∞(4.10)

∀x∈ X :W

ˆη,j(x, ·)≪Ws(x, ·) (4.11)

IP,W

ˆη,j −IP,Ws≤η, (4.12)

and for every j∈ {1, . . . , J}there is s∈ S such that

IP,Ws−IP,W

ˆη,j ≤η. (4.13)

The discussion in Section 4.5.1 provides sufficient topological conditions for (η, J)-

approximability and shows that an example of channels with this property are (possibly

fading) Gaussian channels.

We use a standard random codebook construction: Given a channel input alphabet X, a

distribution Pon X, a block length Mand a rate R, we define the (P, M, R)-ensemble of

codebooks as a random experiment in which exp(MR) codewords of length Mare drawn

randomly and independently according to Pfor each component of each codeword.

A codebook Cinduces a jamming strategy in the following way: The jammer draws a

codeword index Muniformly at random and transmits C(M), the codeword in Cindexed

by M. Therefore, the number of codewords in the codebook controls the amount of

randomness contained in the jamming signal.

In order to be able to impose an average power constraint on the jammer, we define an

additive cost constraint (c,C) for an input alphabet Xconsisting of a function c:X → R+

and a number C∈R+

0. Given any M, we say that xM∈ XMsatisfies the cost constraint

if ∑︁M

m=1 c(xm)≤MC.

The specialization of this definition to a usual average power constraint would be to

pick the square function as cand the maximum admissible average power as C.

4 Security in Over-the-Air Computation

As long as there is at least one xM∈ XMwhich satisfies the cost constraint (c,C),

given any codebook Cof block length M, we can define an associated cost-constrained

codebook Cc,Cwhich is generated from Cby replacing all codewords that do not satisfy

the cost constraint with xM. Obviously, all codewords in a cost-constrained codebook

satisfy the cost constraint. We say that a cost constraint (c,C) is compatible with an input

distribution Pif for a random variable Xdistributed according to P,c(X) has a finite

moment generating function in an interval containing 0 in its interior and C>EPc(X).

We assume a given pre-processing scheme which is admissible in the sense of Definition 9

and consider effective channels incorporating both the pre-processing at the transmitters

and the physical channel. We denote the legitimate user’s effective channel, which is a

stochastic kernel mapping from S1×···×SK×X to Y, by WBand the eavesdropper’s

effective channel, which is a stochastic kernel mapping from S1×···×SK×X to Z, by

WE. The M-fold products of these effective channels are outlined in the system model

in Fig. 4.2. With these concepts and notations defined, we are ready to state the main

result of this work, which gives sufficient conditions for the existence of a jamming scheme

that can simultaneously ensure that the legitimate receiver is able to reconstruct the full

jamming signal and limit the usefulness of the eavesdropper’s received signal.

Theorem 4. Let Pbe a jammer input distribution. Suppose that for every η > 0,

there is some J(η)such that the compound channel (Ws)s∈S defined by W(s1,...,sK):=

WB(s1, . . . , sK,·,·)can be (η, J(η))-approximated under P. Suppose further that for all

s1∈ S1, . . . , sK∈ SK, the moment-generating function

Eexp(a·iP,WE(s1,...,sK,·,·)(X;Z))

of the information density exists and is finite at some point a > 0. Let (c,C) be an additive

cost constraint compatible with P, and let Cbe a random codebook from the (P, M, R)-

ensemble. Let R ∈ (0,∞)such that

sup

s1∈S1,...,sK∈SK

IP,WE(s1,...,sK,·,·)<R<inf

s1∈S1,...,sK∈SK

IP,WB(s1,...,sK,·,·).(4.14)

Then there are numbers γ1, γ2, γ3, γ4>0such that for sufficiently large M,

PC(︃



R

ˆWE(s1,...,sK,·,·)M,Cc,C−RM

P,WE(s1,...,sK,·,·)



TV ≥exp(−Mγ1))︃

<exp(−exp(Mγ2)) (4.15)

and

PC(E)<exp(−Mγ4),(4.16)

4.4 Main Results

where Eis the event that the jamming strategy induced by Cc,Cdoes not allow reconstruction

of the jamming signal with error at most exp(−Mγ3).

Obviously, the theorem is only useful if there exists some Rsatisfying (4.14). The

condition that such an Rexists is the formalization of the notion that the jamming signal

has to be stronger at the legitimate receiver than it is at the eavesdropper.

In Section 4.5.2, we discuss in more detail how the guarantee in (4.15) can be used to

arrive at a MSE security guarantee for the scheme.

Theorem 4 needs a compound channel coding result as an ingredient for its proof, and

since this result is slightly more general than results available in the literature, it may be

of independent interest. Therefore, we also state it in this section.

Theorem 5. Let (Ws)s∈S be a compound channel with input alphabet Xand output al-

phabet Y, and let Pbe a probability distribution on Xsuch that for every η > 0, there is

aJ(η)such that (Ws)s∈S can be (η, J(η))-approximated under P. Let

0<R<inf

s∈S IP,Ws,(4.17)

and let Cbe a random codebook from the (P, M, R)-ensemble. Define an encoder m↦→

C(m). Then there is a decoder such that the average error probability δof the resulting

compound channel code satisfies

EC(δ)<exp(−Mγ),(4.18)

for some γ > 0and sufficiently large M.

With standard techniques, this theorem can be extended to the case of cost-constrained

codebooks. We provide the full details of the proof of the following corollary in Sec-

tion 4.6.5.

Corollary 7. In the setting of Theorem 5, and given an additive cost constraint (c,C)

compatible with P, there are γ1, γ2>0such that for sufficiently large M,

PCc,C(︁δ≥exp(−Mγ1))︁<exp(−Mγ2).(4.19)

As the other main technical ingredient, we need the following result on channel resolv-

ability.

Theorem 6. Given a channel Wfrom Xto Y, an input distribution Psuch that the

moment-generating function EQP,W exp(a·iP,W (X;Y)) of the information density exists

4 Security in Over-the-Air Computation

and is finite for some a > 0, and R>IP,W , there exist γ1>0and γ2>0such that for

large enough block lengths M, the (P, M, R)-ensemble satisfies

PC(︂∥R

ˆW,C−QM

P,W ∥TV >exp(−γ1M))︂≤exp (−exp (γ2M)) ,(4.20)

where R

ˆW,Cis the output distribution of channel Wgiven that a uniformly random code-

word from Cis transmitted.

Similarly as with the compound channel coding theorem, we can use known methods to

incorporate an additive cost constraint and argue the following corollary. For full details,

we refer the reader to Section 4.6.5.

Corollary 8. Let Pbe an input distribution on Xand (c,C)an additive cost constraint

compatible with P. Then the statement of Theorem 6 is valid even if the codebook Cis

replaced with its associated cost-constrained version Cc,C.

4.5 Implications of the Main Results

In this section, we show that the main result on secure OTA computation, Theorem 4,

implies a MSE security guarantee in the case of AWGN channels discussed in Section 4.3.

To this end, we show in Section 4.5.1 that AWGN compound channels satisfy (among

other channel models) the approximability criterion of Definition 14. In Section 4.5.2, we

show how Theorem 4 can be applied to carry the MSE security result of Lemma 12 over

to the case in which the legitimate receiver does not share randomness with the jammer.

4.5.1 Feasibility of Channel Approximation

In this subsection, we provide some tools and examples to argue that many compound

channels of practical interest can indeed be (η, J)-approximated so that Theorem 5 may

be applied to them. We begin with an observation that shows how our result specializes

to the known results [BBT59,Wol59] for channels with finite alphabets.

Remark 5. [BBT59, Lemma 4] implies that for every compound channel (Ws)s∈S with

finite input and output alphabets and every η > 0, there is an integer J(η)such that

(Ws)s∈S can be (η, J(η))-approximated.

We repeat the construction here and discuss how this fact is proved.

Let Mbe an integer which satisfies

M≥max (︃4|Y|3

η2,2|Y|2

η)︃.

4.5 Implications of the Main Results

Given s∈ S, we construct a channel W′

s. To this end, given any x∈ X, we fix an

enumeration (yi)|Y|

i=1 such that the finite sequence (Ws(x, {yi}))|Y|

i=1 is nondecreasing. For

every i < |Y|, we can then uniquely choose a value for W′

s(x, {yi})such that it is an integer

multiple of 1/M and

Ws(x, {yi})≤W′

s(x, {yi})< Ws(x, {yi}) + 1

M.(4.21)

It is argued in [BBT59] that this leaves a positive probability mass for W′

s(x, {y|Y|})and

therefore, this construction fully defines a channel W′

s. We define the approximation se-

quence (W

ˆη,j)J(η)

j=1 as an enumeration of the set {W′

s:s∈ S}. The cardinality of this set

is upper bounded by (M+ 1)|X||Y| since all singleton probabilities are integer multiples of

1/M.

For finite alphabets, (4.10) is trivially satisfied since R´enyi divergence is in this case

always finite [vEH14]. Regarding the absolute continuity criterion (4.11), we recall that

W′

s(x, {y|Y|})always has a positive probability, and for i < |Y|, the assumption Ws(x, {yi}) =

0immediately implies W′

s(x, {y|Y|}) = 0 by (4.21), since 0is the only integer multiple of

1/M which is strictly smaller than 1/M. The proof in [BBT59] exploits (4.21) to prove

that the absolute difference between the information of Wsand W′

sunder any input dis-

tribution is at most 2|Y|3/2M−1/2(statement (c) of the lemma) which by our choice of M

immediately implies (4.12) and (4.13). Moreover, it is shown that (4.21) also implies that

for all x∈ X, y ∈ Y,

log Ws(x, {y})

W′

s(x, {y})≤2|Y|2

(statement (b) of the lemma) which by our choice of Mimplies (4.9).

For many channels of interest, (η, J(η))-approximability can be shown directly by going

through properties (4.9) – (4.13). However, it is often easier to make an argument involving

topological properties of S. The following lemma provides some machinery to this end.

Lemma 15. Let (Ws)s∈S be a compound channel with input alphabet Xand output al-

phabet Y, let Pbe a probability distribution on Xand assume that there is a topology on

Ssuch that Sis compact and

∀s0∈ S :s↦→ EPD1(Ws(X, ·)||Ws0(X, ·)) is upper semi-continuous at s0(4.22)

∀s1, s2∈ S ∃α > 1∀x∈ X :Dα(Ws1(x, ·)||Ws2(x, ·)) <∞(4.23)

s↦→ IP,Wsis lower semi-continuous. (4.24)

Then, for every η > 0, there is J(η)such that (Ws)s∈S can be (η, J(η))-approximated

under P.

4 Security in Over-the-Air Computation

Proof. Fix some η > 0. For a given s∈ S, consider

{s′:EPD1(Ws′(X, ·)||Ws(X, ·)) < η}∩{s′:IP,Ws−IP,Ws′< η}.

Clearly, (4.22) and (4.24) ensure that this intersection is a neighborhood of s, so we

can find an open neighborhood Dscontained in it. Thus, (Ds)s∈S is an open cover of

Sand therefore, the compactness of Syields a finite subcover Ds1,...,DsJ(η). We set

ˆη,j := Wsjand given any s∈ S, we choose jsuch that s∈ Dsjand argue that W

ˆη,j

satisfies (4.9), (4.10) and (4.12). To this end, we note that (4.10) and (4.11) follow from

(4.23), while (4.9) and (4.12) are ensured by the definition of Dsj. Finally, (4.13) is trivially

satisfied, concluding the proof.

We now make use of Lemma 15 to prove that a large class of Gaussian fading multiple-

input and multiple-output channels can actually be (η, J(η))-approximated and thus The-

orem 5 can be applied to them. The class of compound channels covered in the following

theorem contains the class considered in [RV68, Sections 3 and 4] as a proper subset. We

denote the set of symmetric, positive semidefinite n×n-matrices with Symn

+and the set

of symmetric, positive definite n×n-matrices with Symn

++.

Theorem 7. Let X=Rn2,Y=Rn1, let Sbe a compact subset of Rn1n2×Symn1n2

+×Rn1×

Symn1

++ (under the topology induced by the Frobenius norm). For any s= (µH,ΣH, µN,ΣN)∈

S, let Wsbe the channel given by

Y=HX +N,

where the channel input Xhas range Rn2, the channel output Yhas range Rn1, the entries

of the n1×n2fading matrix Hfollow the distribution N(µH,ΣH)and the additive noise

Nis independent of Hand follows the distribution N(µN,ΣN). Let Pbe a distribution on

Xand assume that either Pis a multivariate Gaussian with positive definite covariance

matrix or that the support of Pis contained in some compact set. Then, given any η > 0,

there is J(η)such that (Ws)s∈S can be (η, J(η))-approximated under P.

Proof. We show that the conditions of Lemma 15 are met. [Gil11] provides closed-form

expressions for R´enyi and Kullback-Leibler divergences between multivariate normal dis-

tributions. The only fact that we are going to use and which is apparent from these

expressions, however, is that the R´enyi and Kullback-Leibler divergences between two

multivariate normal distributions are finite and continuous in the mean vectors and co-

variance matrices of the distributions wherever the covariance matrices are positive definite

or, equivalently, both distributions are absolutely continuous with respect to the Lebesgue

measure.

4.5 Implications of the Main Results

ΣN∈Symn1

++ and therefore, given any x∈ X,Ws(x, ·) is absolutely continuous with

respect to the Lebesgue measure and thus has a positive definite covariance matrix and a

density rWs(x,·), which implies (4.23).

Next, from the well-known closed-form expression of the multivariate normal density, we

know that for any xand y,rWs(x,·)(y) is continuous in s. The boundedness of Simplies a

uniform upper bound on rWs(x,·)(y), so we can use the theorem of dominated convergence

to argue that the marginal density rRP,Ws(y) = EPrWs(X,·)(y) depends continuously on s

for any fixed y. We write

IP,Ws=EPRP,Ws(︃rWs(X,·)(Y)

rRP,Ws(Y)log rWs(X,·)(Y)

rRP,Ws(Y))︃.

Since the integrand is lower bounded by −1/e, (4.24) follows as an application of Fatou’s

lemma.

Finally, in order to argue (4.22), we distinguish between the two cases in the statement

of the theorem.

First, suppose that there is a compact subset X

ˆ⊆ X with P(X \X

ˆ) = 0. For any fixed

s0, the map

(s, x)↦→ D1(Ws(x, ·)||Ws0(x, ·))

is continuous, therefore the image of S × X

ˆis compact and hence bounded. We can

therefore invoke the theorem of dominated convergence and argue that (4.22) is satisfied.

Now, suppose that Pis multivariate Gaussian with positive definite covariance matrix.

We write

EPD1(Ws(X, ·)||Ws0(X, ·)) = EPEWs(X,·)log rP(X)rWs(X,·)(Y)

rP(X)rWs0(X,·)(Y)

=EQP,Wslog rQP,Ws(X, Y )

rQP,Ws0(X, Y )

=D1(︁QP,Ws||QP,Ws0)︁.

From our arguments above, given any s, the distribution QP,Wsis multivariate Gaussian

with positive definite covariance matrix, which implies that (4.22) is satisfied.

4.5.2 Back to the Additive White Gaussian Noise case: Calculating Mean

Square Error Security Guarantees

Revisiting the AWGN example from Section 4.3, Theorem 4 and Lemma 12 imply MSE

security and reconstruction guarantees in case the legitimate receiver does not know the

4 Security in Over-the-Air Computation

jamming sequence, as we show in the following corollary.

Corollary 9. Make the same assumptions and definitions as in Lemma 12, but do not

assume that the legitimate receiver has knowledge of the jamming sequence X1, . . . , XM.

Assume in addition that the channel from Jto Bis stronger than the channel from Jto E,

i.e., hJB/σB> hJE/σE. Then there is a distributed approximation scheme with jamming

and there are constants γ1, γ2>0such that for sufficiently large M, the following hold:

•Bcan approximate the objective function f(s1, . . . , sK)with a MSE not exceeding

σ2

eff,BΨ(︄2

σ2

eff,B)︄+ exp(−Mγ1) (4.25)

•The scheme is (f, V )-MSE-secure, where

V:= σ2

eff,EΨ(︄2

σ2

eff,E)︄−exp(−Mγ2).(4.26)

Proof. For the pre-processing at the transmitters, we use the same scheme as in the proof

of Lemma 12 and begin by verifying that the resulting effective channels WBand WEwith

the input distribution Pchosen to be Gaussian with mean 0 and variance PJsatisfy the

assumptions of Theorem 4. Since the defined compound channel is a class of Gaussian

channels with different means taking values in the compact set [−1,1], the approximability

of the channel is an immediate consequence of Theorem 7. The finiteness of the moment-

generating function of the information density can be seen by straightforward applications

of the definitions of information density and R´enyi divergence:

Eexp(a·iP,WE(s1,...,sK,·,·)(X;Z))

=E(︃(︃dWE(s1, . . . , sK, X, ·)

dRP,WE(s1,...,sK,·,·)

(Z))︃a)︃

= exp (︃a·1

alog E(︃(︃dWE(s1, . . . , sK, X, ·)

dRP,WE(s1,...,sK,·,·)

(Z))︃a)︃)︃

= exp (︁aDa+1 (︁QP,WE(s1,...,sK,·,·)||PRP,WE(s1,...,sK,·,·))︁)︁

The R´enyi divergence appearing at the end is between two multivariate Gaussian distri-

butions and can be seen to be finite from the expressions given in [Gil11]. In order to

verify (4.14), we first note that the information expressions appearing are the capacities

of the effective channels WBand WE. Since s1, . . . , sKchange the mean of the channel

only, they do not influence the capacity. Therefore, the infimum and supremum are over

4.6 Proofs

singleton sets. Consequently, the condition hJB/σB> hJE/σEensures that there is some

Rsatisfying (4.14).

Fix γ1′, γ3′as claimed to exist in Theorem 4, and also fix γ1, γ2with 0 < γ2< γ1′and

0< γ1< γ3′.

Note that in the AWGN channel, s1, . . . , sKcorrespond to a shift of the output dis-

tribution of the channel, and therefore, the variational distance that appears in (4.15) is

independent of s1, . . . , sK. For sufficiently large M, we can therefore fix a codebook C

from the (P, M, R)-ensemble such that for all s1, . . . , sK, neither one of the error events

described in (4.15) and (4.16) occurs.

Let the jamming strategy be induced by CC,cand let d:ZM→[−1,1] be an estimator

for E. We can now bound the MSE of d:

ˆWE(s1,...,sK,·,·)M(︂(︁d(ZM)−f(s1, . . . , sK))︁2)︂

=∫︂4

ˆWE(s1,...,sK,·,·)M(︂(︁d(ZM)−f(s1, . . . , sK))︁2> a)︂da

(4.15)

≥∫︂4

0(︃RM

P,WE(s1,...,sK,·,·)(︂(︁d(ZM)−f(s1, . . . , sK))︁2> a)︂−exp(−Mγ1′))︃da

=ERM

P,WE(s1,...,sK,·,·)(︂(︁d(ZM)−f(s1, . . . , sK))︁2)︂−4 exp(−Mγ1′).

Taking the lower bound for the MSE under RM

P,WE(s1,...,sK,·,·)from Lemma 12 and noting

γ2< γ1′, we arrive at the expression in (4.26) for sufficiently large M.

For the reconstruction strategy at B, we first let Breconstruct the jamming signal as

is possible by Theorem 4 and then post-process the received signal as is possible with

knowledge of the jamming signal by Lemma 12. Using the error bound in Lemma 12 and

observing that the maximum instantaneous square error is 4 since we are constrained to

an interval of length 2 and that γ1< γ3′, for sufficiently large Mwe arrive at (4.25).

4.6 Proofs

In this section, we prove the Lemmas used in the proof of Lemma 12 and our main results

on secure OTA computation and compound channel coding, Theorems 4 and 5, as well as

the corollaries that allow for the incorporation of an average cost constraint.

4.6.1 Statistical Preliminaries for the Proof of Lemma 12

In this subsection, we prove the two lemmas used for the proof of Lemma 12.

Proof of Lemma 13. It is known [Jay03, eq. (6.92)] that the MSE is minimized by the

4 Security in Over-the-Air Computation

mean of the posterior probability distribution. We can therefore calculate the minimum

MSE estimator given the observations v1, . . . , vMas follows, where we use rwith random

variables in the index to denote (conditional) densities.

ˆ=∫︂b

grU|V1,...,VM(g|v1, . . . , vM)dg

(a)

=∫︂b

grV1,...,VM|U(v, . . . , vM|g)rU(g)

rV1,...,VM(v1, . . . , vM)dg

=∫︁b

agrV1,...,VM|U(v1, . . . , vM|g)rU(g)dg

∫︁b

arV1,...,VM|U(v1, . . . , vM|g)rU(g)dg

(b)

=∫︁b

agexp (︂−1

2σ2∑︁M

m=1(vm−g)2)︂dg

∫︁b

aexp (︂−1

2σ2∑︁M

m=1(vm−g)2)︂dg

=∫︁b

agexp (︂−1

2σ2/M (︂1

M∑︁M

m=1 v2

m−2gv¯ + g2)︂)︂dg

∫︁b

aexp (︂−1

2σ2/M (︂1

M∑︁M

m=1 v2

m−2gv¯ + g2)︂)︂dg

(c)

=∫︁b

agexp (︂−1

2σ2/M (v¯−g)2)︂dg

∫︁b

aexp (︂−1

2σ2/M (v¯−g)2)︂dg

For (a), we have applied Bayes’ rule. (b) is by observing that rU(g)=1/(b−a) is

independent of gin [a, b] and rV1,...,VM|U is the normal density. (c) is by multiplying

exp (︄−1

2σ2/M (︄v¯2−1

∑︂

m=1

m)︄)︄

on both sides of the fraction to complete the binomials.

The term we have calculated for U

ˆis the mean of a normal distribution centered at v¯ with

variance σ2/M truncated in [a, b]. This is a distribution with a known mean [JKB94, eq.

13.134], and hence we arrive at (4.6).

Proof of Lemma 14. Based on the representation (4.6), we calculate the MSE as follows.

We use the substitution rule, substituting v′:= v¯−a

σ/√Min (a) and g′:= g−a

σ/√Min (b).

E(︃(︂U −U

ˆ)︂2)︃=∫︂b

a∫︂∞

−∞ ⎛

⎝v¯ + σ

√M·

φN(︂a−v¯

σ/√M)︂−φN(︂b−v¯

σ/√M)︂

ΦN(︂b−v¯

σ/√M)︂−ΦN(︂a−v¯

σ/√M)︂−g⎞

⎠

·1

b−a·1

σ/√MφN(︃g−v¯

σ/√M)︃dv¯dg

4.6 Proofs

(a)

=∫︂b

a∫︂∞

−∞ ⎛

⎜

⎝

√M⎛

⎝v′+

φN(−v′)−φN(︂b−a

σ/√M−v′)︂

ΦN(︂b−a

σ/√M−v′)︂−ΦN(−v′)⎞

⎠+a−g⎞

⎟

⎠

·1

b−a·φN(︃g−a

σ/√M−v′)︃dv′dg

(b)

=∫︂b−a

σ/√M

0∫︂∞

−∞ ⎛

⎝v′+

φN(−v′)−φN(︂b−a

σ/√M−v′)︂

ΦN(︂b−a

σ/√M−v′)︂−ΦN(−v′)−g′⎞

⎠

·(︃σ

√M)︃3

·1

b−a·φN(g′−v′)dv′dg′

=σ2

MΨ(︃b−a

σ/√M)︃,

concluding the proof of the lemma.

4.6.2 Proof of Theorem 4

In order to prove Theorem 4, we decompose the system depicted in Fig. 4.2 into smaller

(and more easily analyzed) subsystems by considering only a subset of the depicted ter-

minals at a time.

1. Considering the terminals A1,...,AK,B.This is the system summarized in Sec-

tion 2.2.1. The rationale is that the results specialize to the setting in Section 4.3

as well as, e.g., to the fast-fading setting treated in Chapter 2. This part of the

system consists of transmitters (Ak)K

k=1 each of which holds a value sk∈ Skand a

receiver Bwhich has the objective of estimating f(s1, . . . , sK). To this end, each

transmitter Akpasses skthrough a pre-processor Ekindependently Mtimes yielding

a sequence TM

kof channel inputs. These are transmitted through Mindependent

uses of the channel, generating a sequence YMof channel outputs. The receiver

passes this sequence through a post-processor DMwhich generates an approxima-

tion f

˜of f(s1, . . . , sK). As mentioned, the design of the pre- and post-processors

depends heavily on the channel model and a particular class of functions f. The

idea is that the pre-processors, the channel and the post-processor work together to

mimic the function f, and any approach following this idea will be highly dependent

on the particular structure of the channel and f. In Theorem 4, it is assumed that

such a system is already in place and an augmentation is proposed which makes it

more secure. A property of the system described in Section 2.2.1 necessary for our

purposes and heavily exploited in this work is that the pre-processing is i.i.d., i.e.,

4 Security in Over-the-Air Computation

each pre-processor Ekis a stochastic kernel mapping from Skto Tkand an M-fold

product EM

kof it is used to generate the channel input sequence.

2. Considering the terminals A1,...,AK,J,E.In this setting, we assume that the

transmitters A1,...,AKrun a scheme of the kind described under item 1. Instead

of the legitimate receiver, there is now an eavesdropper E. The objective is then to

limit the usefulness of the eavesdropper’s received signal ZM. To this end, we add a

friendly jammer Jto the system which transmits, according to a certain strategy, a

word XM. In this work, any jamming strategy we consider is induced by a codebook

Cof words of length Mthrough the rule that the jammer chooses an element of the

codebook uniformly at random and transmits it. We use existing results on channel

resolvability to derive a bound on the usefulness of the signal ZMreceived at E.

3. Considering the terminals A1,...,AK,J,B.This is the setting from item 1 with an

additional transmitter J. Here we assume that Juses a jamming strategy induced by

a codebook Cas described under item 2 and use Theorem 5 on compound channel

coding to argue that for suitable choices of C,Bis able to fully reconstruct the

jamming signal XM. This enables Bto perform a cancellation of the jamming signal

before it applies the post-processor DMit would use in the setting of item 1. How

this cancellation works depends on the particularities of the channel considered, but

if, e.g., the jamming signal is simply added to the channel output as in the AWGN

example in Section 4.3, it is possible to cancel it entirely by subtracting it from the

received signal. So in this case the post-processor would consist of a reconstruction

of the jamming signal, the subtraction of this signal from the received one and a

post-processing step identical to that from item 1.

4. Combining settings of item 2 and 3. The goal here is to argue the existence of

a codebook Cwhich achieves both of the objectives described under item 2 and

item 3. It will turn out that this can be achieved by a standard random codebook

construction.

The main result of this work, Theorem 4, formulates conditions under which there

are codebooks in the (P, M, R)-ensemble of which the (c,C)-cost constrained versions

simultaneously achieve the goals set forth under 2) and 3).

Proof of Theorem 4. An application of Corollary 7 yields (4.16), and (4.15) follows from

Corollary 8.

4.6 Proofs

4.6.3 Proof of Theorem 5

We first pick parameters η,ε,β1and β2in sequence according to the following scheme,

where (4.17) and the previous choices ensure that these intervals are all nonempty.

η∈(︃0,infs∈S IP,Ws−R

3)︃(4.27)

ε∈(︃2η, inf

s∈S IP,Ws−R−η)︃(4.28)

β1∈(η, ε −η) (4.29)

β2∈(0, ε −η−β1) (4.30)

Fix a sequence (W

ˆη,j)J(η)

j=1 which (η, J(η))-approximates (Ws)s∈S.

We use a joint typicality decoder, i.e. if there is a unique msuch that

∃j∈ {1, . . . , J(η)}:iP,W

ˆη,j (C(m); YM)≥M(IP,W

ˆη,j −ε),

the decoder declares that message mhas been sent; otherwise it declares an error (or that

message 1 has been sent).

We denote the transmitted message with M, the message declared by the decoder with

ˆand define error events

E:= {M=M

ˆ}(4.31)

E1:= {︂∀j∈ {1, . . . , J(η)}iP,W

ˆη,j (C(M); YM)< M(IP,W

ˆη,j −ε)}︂(4.32)

E2:= {︂∃m=M∃j∈ {1, . . . , J(η)}iP,W

ˆη,j (C(m); YM)≥M(IP,W

ˆη,j −ε)}︂.(4.33)

We note that E ⊆ E1∪E2and consequently

P(E)≤P(E1) + P(E2).(4.34)

So we can bound these two errors separately and then combine them.

We start with bounding the expectation of the first summand, using the definition (4.32)

and C, as well as an addition of zero. Pick jsuch that W

ˆη,j satisfies (4.9) – (4.12) with

4 Security in Over-the-Air Computation

respect to the realization Wsof the compound channel. Then we have

EC(P(E1)) ≤EC(︂P(︂iP,W

ˆη,j (C(M); YM)< M(IP,W

ˆη,j −ε))︂)︂

=QM

P,Ws(︂iP,W

ˆη,j (XM;YM)< M(IP,W

ˆη,j −ε))︂

=QM

P,Ws(︄M

∑︂

m=1

log (︄dW

ˆη,j(Xm,·)

dRP,W

ˆη,j

(Ym))︄< M(IP,W

ˆη,j +IP,Ws−IP,Ws−ε))︄

(4.35)

The Radon-Nikodym derivative can be split as

ˆη,j(Xm,·)

dRP,W

ˆη,j

=dW

ˆη,j(Xm,·)

dWs(Xm,·)·dRP,Ws

dRP,W

ˆη,j ·dWs(Xm,·)

dRP,Ws

.(4.36)

This is possible because W

ˆη,j(x, ·)≪Ws(x, ·) by (4.11), RP,Ws≪RP,W

ˆη,j by (4.9) and the

joint convexity of Kullback-Leibler divergence in its arguments, and Ws(x, ·)≪RP,Wsfor

P-almost all xby the properties of the marginalization.

We next bound tail probabilities corresponding to the three factors in (4.36) separately,

starting with the first. To this end, we introduce a number α1>1 and argue, using

Markov’s inequality and the definition of R´enyi divergence, that

P,Ws(︄M

∑︂

m=1

log dWs(Xm,·)

ˆη,j(Xm,·)(Ym)≥Mβ1)︄

=QM

P,Ws(︄exp (︄(α1−1)

∑︂

m=1

log dWs(Xm,·)

ˆη,j(Xm,·)(Ym))︄≥exp((α1−1)Mβ1))︄

≤EQM

P,Ws⎛

⎝⎛

⎝

∏︂

m=1 (︄dWs(Xm,·)

ˆη,j(Xm,·)(Ym))︄α1−1⎞

⎠⎞

⎠exp(−(α1−1)Mβ1)

= exp ⎛

⎝

∑︂

m=1

log ⎛

⎝EQM

P,Ws⎛

⎝(︄dWs(Xm,·)

ˆη,j(Xm,·))︄α1−1⎞

⎠⎞

⎠exp(−(α1−1)Mβ1)

= exp (︂−(α1−1)M(︂β1−EPDα1(︂Ws(X, ·)||W

ˆη,j(X, ·))︂)︂)︂.(4.37)

For the second factor, we argue in an analogous way, but using α2>0.

4.6 Proofs

P,Ws(︄M

∑︂

m=1

log dRP,W

ˆη,j

dRP,Ws

(Ym)≥Mβ2)︄

=RM

P,Ws(︄exp (︄α2

∑︂

m=1

log dRP,W

ˆη,j

dRP,Ws

(Ym))︄≥exp(α2Mβ2))︄

≤ERM

P,Ws(︄M

∏︂

m=1 (︄dRP,W

ˆη,j

dRP,Ws

(Ym))︄α2)︄exp(−α2Mβ2)

= exp (︂(α2−1)MDα2(︂RP,W

ˆη,j ||RP,Ws)︂−α2Mβ2)︂.(4.38)

Finally, for the third factor, we use α3<1.

P,Ws(︂iP,Ws(XM;YM)< M(IP,Ws−ε+β1+β2+η))︂

=QM

P,Ws(︁exp (︁(α3−1)iP,Ws(XM;YM))︁>exp ((α3−1)M(IP,Ws−ε+β1+β2+η)) )︁

≤EQM

P,Ws(︄M

∏︂

m=1 (︃dWs(Xm,·)

dRP,Ws

(Ym))︃α3−1)︄exp (−(α3−1)M(IP,Ws−ε+β1+β2+η))

= exp (︂−(1 −α3)M(︁Dα3(QP,Ws||PRP,Ws) + ε−IP,Ws−β1−β2−η)︁)︂,(4.39)

Clearly, by (4.36), the union bound and (4.12), (4.35) is upper bounded by the sum of

(4.37), (4.38) and (4.39). Next, we argue that these expressions all vanish exponentially

with M→ ∞, using the continuity of R´enyi divergence in the order which is shown

in [vEH14, Theorem 7].

From (4.10), the theorem of monotone convergence and (4.9), we can conclude that

lim

α1↘1

EPDα1(︂Ws(Xm,·)||W

ˆη,j(Xm,·))︂=EPD1(︂Ws(Xm,·)||W

ˆη,j(Xm,·))︂≤η,

so, (4.29) allows us to fix α1at a value greater than 1 such that

β1−EPDα1(︂Ws(Xm,·)||W

ˆη,j(Xm,·))︂>0

and hence, (4.37) vanishes exponentially.

(4.38) is true for all α2<1. Since the inequalities are not strict, we can take the limit

α2↗1 and argue that the statement is also valid for α2= 1.

Dα3(QP,Ws||PRP,Ws) converges to IP,Wsfrom below for α3↗1 and so (4.30) allows us

4 Security in Over-the-Air Computation

to fix α3at a value less than 1 such that

Dα3(QP,Ws||PRP,Ws) + ε−IP,Ws−β1−β2−η > 0

and therefore, (4.39) also vanishes exponentially.

For the second summand in (4.34), we use the definition (4.33) to argue that EC(P(E2))

is upper bounded by

exp(MR)

J(η)

∑︂

j=1

PMRM

P,Ws(︃iP,W

ˆη,j (XM;YM)≥M(︂IP,W

ˆη,j −ε)︂)︃.(4.40)

We define the indicator function

ind(xM, yM) := ⎧

⎨

⎩

1,iP,W

ˆη,j (xM;yM)≥M(︂IP,W

ˆη,j −ε)︂

0,otherwise.

Using the definition of information density for a change of measure and multiplying one,

we rewrite the probability that appears in (4.40) as

PMRM

P,Ws(︃iP,W

ˆη,j (XM;YM)≥M(︂IP,W

ˆη,j −ε)︂)︃

=∫︂

XM×YM

ind(xM, yM)·PMRM

P,Ws(dxM, dyM)

=∫︂

XM×YM

exp (︁−iP,Ws(xM;yM))︁ind(xM, yM)QM

P,Ws(dxM, dyM)

=∫︂

XM×YM

exp (︁−iP,Ws(xM;yM) + iP,W

ˆη,j (xM;yM)−iP,W

ˆη,j (xM;yM))︁

·ind(xM, yM)QM

P,Ws(dxM, dyM)

Because of the presence of the indicator, we can uniformly bound

iP,W

ˆη,j (xM;yM)≥M(︂IP,W

ˆη,j −ε)︂

4.6 Proofs

and the indicator itself can be upper bounded by 1. This yields

PMRM

P,Ws(︃iP,W

ˆη,j (XM;YM)≥M(︂IP,W

ˆη,j −ε)︂)︃

≤exp (︂−M(︂IP,W

ˆη,j −ε)︂)︂

∫︂

XM×YM

exp (︂−iP,Ws(xM;yM) + iP,W

ˆη,j (xM;yM))︂QM

P,Ws(dxM, dyM)

We expand the definition of information density and apply Fubini’s Theorem to rewrite

the integral as

∫︂

⎛

⎝∫︂

ˆη,j

M(xM,·)

dRM

P,W

ˆη,j

(yM)PM(dxM)⎞

⎠RM

P,Ws(dyM)

and observe that it equals 1.

Combining with (4.40) and applying (4.13), we obtain

EC(P(E2)) ≤exp(MR)

J(η)

∑︂

j=1

exp (︂−M(︂IP,W

ˆη,j −ε)︂)︂

≤exp(MR)

J(η)

∑︂

j=1

exp (︃−M(︃inf

s∈S IP,Ws−ε−η)︃)︃

= exp (︃−M(︃inf

s∈S IP,Ws−ε−R−η−log J(η)

M)︃)︃.(4.41)

We observe that by (4.28), infs∈S IP,Ws−ε−R−η > 0.

Finally, we pick

γ∈(︄0,min (︃(α1−1) ·(︂β1−EPDα1(︂Ws(Xm,·)||W

ˆη,j(Xm,·))︂)︂,

β2,

(1 −α3)(︁Dα3(QP,Ws||PRP,Ws) + ε−IP,Ws−β1−β2−η)︁,

inf

s∈S IP,Ws−ε−R−η)︃)︄.

Since the exponent in (4.41) is then negative for sufficiently large M, we can combine it

with (4.37), (4.38) and (4.39) to obtain (4.18).

4 Security in Over-the-Air Computation

4.6.4 Proof of Theorem 6

In order to prove the theorem, given a codebook C, we write the variational distance as

∥R

ˆW,C−RM

P,W ∥TV = sup

A⊆XM

measurable (︂R

ˆW,C(A)−RM

P,W (A))︂

= sup

A⊆XM

measurable ∫︂A(︄dR

ˆW,C

dRM

P,W

(yM)−1)︄RM

P,W (dyM)

=ERM

P,W [︄dR

ˆW,C

dRM

P,W

(yM)−1]︄+

.(4.42)

Note that throughout the proofs, we only consider codebooks Cfor which R

ˆW,Cis abso-

lutely continuous with respect to RM

P,W . We can do this because the existence of a finite

mutual information implies that W(x, ·) is absolutely continuous with respect to RP,W

for almost every x, and so the probability of drawing a codebook for which R

ˆW,Cis not

absolutely continuous with respect to RM

P,W is 0. Similarly, we assume the existence of the

other Radon-Nikodym derivatives that appear.

We define the typical set

Tε:= {︃(xM, yM) : 1

MiP,W (xM;yM)≤IP,W +ε}︃(4.43)

and split R

ˆW,Cinto two measures

ˆ1,W,C(A) := exp(−MR)

exp(MR)

∑︂

m=1

WM(︁C(m), A ∩{yM: (C(m), yM)∈Tε})︁(4.44)

ˆ2,W,C(A) := exp(−MR)

exp(MR)

∑︂

m=1

WM(︁C(m), A ∩{yM: (C(m), yM)/∈Tε})︁.(4.45)

We observe R

ˆW,C=R

ˆ1,W,C+R

ˆ2,W,C, which allows us to split (4.42) into a typical and

an atypical part

∥R

ˆW,C−RM

P,W ∥TV =ERM

P,W [︄dR

ˆ1,W,C

dRM

P,W

(yM) + dR

ˆ2,W,C

dRM

P,W

(yM)−1]︄+

≤ERM

P,W [︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+

ˆ2,W,C(YM).(4.46)

We next state and prove two lemmas that we will use as tools to bound the typical and

4.6 Proofs

atypical parts of this term separately.

Lemma 16 (Bound for atypical terms).Suppose QM

P,W (XM×YM\Tε)≤aand δ∈[0,1].

Then

PC(︁R

ˆ2,W,C(YM)> a(1 + δ))︁≤exp (︃−1

3δ2aexp(MR))︃.

Proof. Observe EC(R

ˆ2,W,C(YM)) = QM

P,W (XM×YM\Tε)≤aand bound

PC(︂R

ˆ2,W,C(YM)> a(1 + δ))︂

=PC(︂exp(MR)R

ˆ2,W,C(YM)> a exp(MR)(1 + δ))︂

=PC(︄exp(MR)

∑︂

m=1

WM(︁C(m),{yM: (C(m), yM)/∈Tε})︁> a exp(MR)(1 + δ))︄

≤exp (︃−1

3δ2aexp(MR))︃.

The inequality follows from the Chernoff-Hoeffding bound [DP09, Ex. 1.1] by noting

that we sum probabilities (i.e. values in [0,1]) on the left side, that these probabilities

are independently distributed under PCand that by the hypothesis of the lemma the

expectation of the term on the left is bounded by aexp(MR).

Lemma 17 (Bound for typical terms).Let δ, λ > 0and define

r:= exp(M(R−I(X;Y)−ε)).(4.47)

Suppose r/(6λ)≥1. Then

PC(︄ERM

P,W [︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+

> δ)︄

≤(︄1 + √︃3π

2exp (︃3λ2

4r)︃λ

√r+ exp(−λ))︄exp(−δλ),(4.48)

where π≈3.14159 is the area of the unit circle.

Before we prove this lemma, we make an observation that we need in the proof.

Lemma 18. Let Ξbe a measurable function mapping codebooks and elements of YMto

the nonnegative reals and let λ, δ > 0. Then

PC(︁EYMΞ(C, Y M)> δ)︁≤EYMEC(︂exp(λΞ(C, Y M)))︂exp(−δλ).(4.49)

4 Security in Over-the-Air Computation

Proof. An application of the Chernoff bound yields

PC(︁EYMΞ(C, Y M)> δ)︁≤EC(︁exp (︁λEYMΞ(C, Y M))︁)︁exp(−δλ).

We can then prove (4.49) by successive applications of Jensen’s inequality and Fubini’s

theorem.

Proof of Lemma 17. We begin by examining parts of the term in (4.48) for fixed, but

arbitrary Cand yMand rewrite

rdR

ˆ1,W,C

dRM

P,W

(yM) =

exp(MR)

∑︂

m=1

exp (M(−I(X;Y)−ε)) dWM(C(m),·)

dRM

P,W

(yM)1(C(m),yM)∈Tε.

Now, we observe that the indicator function bounds the relative density to be at most

exp(M(I(X;Y) + ε)) and thus every term in the sum to range within [0,1] and that

furthermore

EC(︄rdR

ˆ1,W,C

dRM

P,W

(yM))︄≤exp (M(−I(X;Y)−ε))

exp(MR)

∑︂

m=1

EC(︄dWM(C(m),·)

dRM

P,W

(yM))︄=r.

We then use these observations to yield, for any ξ > 0,

PC(︄exp (︄λ[︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+)︄>exp(λξ))︄=PC(︄dR

ˆ1,W,C

dRM

P,W

(yM)>1 + ξ)︄(4.50)

=PC(︃rdR

ˆ1,W,C

dRM

P,W

(yM)>(1 + ξ)r)︃

≤exp ⎛

⎝−ξ2

2(︂1 + ξ

3)︂r⎞

⎠,(4.51)

where (4.50) holds because the two measured events are equal and (4.51) follows by the

Chernoff-Hoeffding bound [McD98, Theorem 2.3b]. (4.51) can be upper bounded by

exp (︃−ξ2

3r)︃(4.52)

for ξ≤1 (in particular) and by

exp (︃−ξ

3r)︃(4.53)

4.6 Proofs

for ξ≥1 (in particular). We will in the following use the substitutions

a:= exp(λξ) (4.54)

b:= log(a)

λ√︃2r

3−√︃3

2rλ. (4.55)

Since we will be using (4.55) for integration by substitution, we note that it implies

dba= exp (︄bλ√︃3

2r+λ23

2r)︄λ√︃3

2r.(4.56)

We have, e.g. by [Bil95, Eq. 21.9],

EC(︄exp (︄λ[︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+)︄)︄=∫︂∞

PC(︄exp (︄λ[︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+)︄> a)︄da

and upper bound this integral by splitting the integration domain into three parts: The

integration over [0,1] can be upper bounded by 1 (since the integrand is a probability).

The integration over [1,exp(λ)] can be upper bounded as

∫︂exp(λ)

PC(︄exp (︄λ[︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+)︄> a)︄da

≤∫︂∞

exp (︃−(log a)2

3λ2r)︃da (4.57)

=∫︂∞

exp (︄−b2λ23

2r+ 2bλ3(︁3

2r)︁3

2+λ4(︁3

2r)︁2

3λ2r+bλ√︃3

2r+λ23

2r)︄λ√︃3

2rdb (4.58)

=∫︂∞

exp (︃−b2

2)︃db ·exp (︃3λ2

4r)︃λ√︃3

2r.(4.59)

(4.57) follows by substituting (4.52) as well as (4.54) and enlarging the integration domain

to [1,∞), which can be done because the integrand is nonnegative. (4.58) follows by the

rule for integration by substitution using (4.55).

4 Security in Over-the-Air Computation

The integration over [exp(λ),∞) can be upper bounded as

∫︂∞

exp(λ)

PC(︄exp (︄λ[︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+)︄> a)︄da ≤∫︂∞

exp(λ)

exp (︃−log a

3λr)︃da (4.60)

=∫︂∞

exp(λ)

a−r/(3λ)da

=exp(λ(1 −r/(3λ)))

r/(3λ)−1

≤exp(−λ),(4.61)

where (4.60) is by (4.53) and (4.61) is true because r/(6λ)≥1. We now apply Lemma 18

with Ξ(C, yM) := [︃dR

ˆ1,W,C

dRM

P,W

(yM)−1]︃+

. In the resulting bound, we substitute the bound of

1 for integration domain [0,1] as well as (4.59) and (4.61), substitute back (4.47) and note

that exp(−b2/2) is the well-known unnormalized standard normal density, and get (4.48).

Proof of Theorem 6. In order to bound the atypical term in the sum (4.46), note first that

for any α > 1,

P,W (XM×YM\Tε)

=QM

P,W (︁{︁(xM, yM) : i(xM, yM)/M > I(X;Y) + ε}︁)︁

=QM

P,W (︁{(xM, yM) : exp (︁(α−1)i(xM, yM))︁>exp ((α−1)M(I(X;Y) + ε))})︁

≤∫︂XM×YM

exp (︁(α−1)i(xM, yM))︁QP,W (d(xM, yM)) ·exp (−(α−1)M(I(X;Y) + ε))

(4.62)

= exp log (︄∫︂XM×YM(︄dW M(C(m),·)

dRM

P,W

(yM))︄α−1

·QP,W (d(xM, yM)))︄

·exp (−M(α−1) (I(X;Y) + ε))

= exp (−M(α−1) (I(X;Y) + ε−Dα(QP,W ||PRP,W ))) (4.63)

≤exp(−Mβ1),(4.64)

where (4.62) follows by applying Markov’s inequality and (4.64) as long as

β1≤(α−1) (I(X;Y) + ε−Dα(QP,W ||PRP,W )) .(4.65)

Note that since the moment-generating function EQP,W exp(a·i(X, Y )) exists and is

finite for some a > 0, there is some α′>1 such that Dα′(QP,W ||PRP,W ) is finite,

100

4.6 Proofs

and thus Dα(QP,W ||PRP,W ) is finite and continuous in αfor α≤α′[vEH14]. Since

Dα(QP,W ||PRP,W )→I(X;Y) for α→1, we can choose α > 1, but sufficiently close to 1

such that the bound on β1is positive.

We can now apply Lemma 16 with a:= exp(−Mβ1) and δ:= 1 and get

PC(︁R

ˆ2,W,C(YM)>2 exp(−Mβ1))︁≤exp (︃−1

3exp(M(R−β1)))︃.(4.66)

To bound the typical term in (4.46), we apply Lemma 17 with λ:= exp(Mβ2) and

δ:= exp(−Mβ1), which yields

PC(︄ERM

P,W [︄dR

ˆ1,W,C

dRM

P,W

(yM)−1]︄+

>exp(−Mβ1))︄

≤(︄1 + √︃3π

2exp (︃3

4exp(−M(R−I(X;Y)−ε−2β2)) −1

2M(R−I(X;Y)−ε−2β2))︃

+ exp(−exp(Mβ2)))︄exp (−exp(M(β2−β1))) (4.67)

as long as Mis sufficiently large such that exp(M(R−I(X;Y)−ε))/6≥1.

We are now ready to put everything together: Considering (4.46), (4.66) and (4.67), an

application of the union bound yields the sum of (4.66) and (4.67) as an upper bound for

PC(︂∥R

ˆW,C−RM

P,W ∥TV >3 exp(−Mβ1))︂.

We choose ε < R − I(X;Y), then β1<(R − I(X;Y)−ε)/2 small enough to satisfy

(4.65), then β2such that β1< β2<(R−I(X;Y)−ε)/2, and finally we choose γ1< β1

and γ2<min(R−β1, β2−β1). With these choices, we get (4.20) for all sufficiently large

M, thereby concluding the proof.

4.6.5 Cost Constraint in Compound Channel Coding and Resolvability

In this section, we prove Corollaries 7 and 8 which essentially state that the conclusions of

Theorems 5 and 6 are also true for the case of cost-constrained codebooks. The approach

used is similar to the one in [EGK11, Section 3.3], but we include the adapted derivations

in full here for the sake of self-containedness. We begin with a series of preliminary lemmas

and conclude the section with the proofs of the corollaries.

Lemma 19. Let (Ui)i≥1be a sequence of i.i.d. random variables such that the moment

generating function φ(λ) := Eexp(λU1)exists on an interval containing 0in its interior.

101

4 Security in Over-the-Air Computation

Let C>EU1. Then there exists γ > 0such that

P(︄n

∑︂

i=1 Ui> nC)︄≤exp(−nγ).

Proof. We can without loss of generality assume that C= 0 and E(U1)<0, because

otherwise we could consider the random variables (Ui−C)i≥1instead.

Clearly, φ(0) = 1 and φ′(0) = E(U1)<0, so we can find some λ > 0 sufficiently small

such that φ(λ)<1. With this choice of λ, we can apply Markov’s inequality and get

P(︄n

∑︂

i=1 Ui>0)︄=P(︄exp (︄λ

∑︂

i=1 Ui)︄>1)︄

≤E(︄exp (︄λ

∑︂

i=1 Ui)︄)︄

=φ(λ)n

so the lemma follows by choosing γ:= −log φ(λ).

Lemma 20. Let Nbe a Bernoulli random variable with exp(MR)trials and success

probability p≤exp(−Mβ1)where β1<R/2. Then there are γ1, γ2>0such that for

sufficiently large M,

P(N>exp(M(R−γ1))) ≤exp(−exp(Mγ2)).(4.68)

Proof. We choose γ1,γ2and β2such that 0 < γ1< β1< β2<R/2 and γ2<R−2β2.

Then

P(︁N>exp(M(R−γ1)))︁

=P(︁N> p exp(MR) + (exp(−Mγ1)−p) exp(MR))︁

≤P(︁N>EN+ (exp(−Mγ1)−exp(−Mβ1)) exp(MR))︁

≤P(︁N>EN+ exp(−Mβ2) exp(MR))︁(4.69)

≤exp (︄−2(︁exp(−Mβ2))︁2(︁exp(MR))︁2

exp(MR))︄(4.70)

= exp (−2 exp(M(R−2β2))) (4.71)

≤exp(−exp(Mγ2)),(4.72)

where (4.70) follows by the Chernoff-Hoeffding bound as stated for instance in [DP09,

102

4.6 Proofs

Theorem 1.1, eq. (1.6)].

Lemma 21. Let Pbe a probability distribution on X. Assume moreover that c(X)has

a moment generating function defined in an interval with 0in its interior and that C>

EPc(X). Denote the number of bad codewords in Cwith

N:=

exp(MR)

∑︂

m=1

1∑︁M

m=1 c(C(m)(m))>MC.

Then there are γ1, γ2>0such that

PC(N>exp(M(R−γ1))) ≤exp(−exp(Mγ2)).(4.73)

Proof. Since the codeword components are i.i.d., we can apply Lemma 19 and obtain an

arbitrarily small β1>0 such that for all m,

p:= PC(︄M

∑︂

m=1

c(C(m)(m)) > MC)︄≤exp(−Mβ1).

So since the codewords are independent, Nis a Bernoulli variable with exp(MR) trials

and success probability p, and an application of Lemma 20 proves the conclusion.

Proof of Corollary 7. Assume throughout the proof that Mis sufficiently large. By Lemma 21,

we have γ1

ˆ, γ2

ˆ∈(0,∞) with

PC(E

ˆ)≤exp(−exp(Mγ2

ˆ )),(4.74)

where

ˆ:= {PM(C(M)=Cc,C(M)) >exp(−Mγ1

ˆ )}.

We denote the error of Cwith δCand the error of Cc,Cwith δCc,C. By Theorem 5 and

Markov’s inequality, we have, for some γˆ∈(0,∞) given by the theorem and with choices

γ1

˜∈(0,min(γˆ, γ1

ˆ )), γ2

˜∈(0, γˆ−γ1

˜ ),

PC(δC≥exp(−Mγ1

˜ )) ≤ECδCexp(Mγ1

˜ )

≤exp(−M(γˆ−γ1

˜ ))

≤exp(−Mγ2

˜ ).(4.75)

103

4 Security in Over-the-Air Computation

Conditioned on the complement of E

ˆ, we have

δCc,C

(a)

= sup

s∈S

EM(︂Ps(︁M=DM(YM)|XM=Cc,C(M))︁)︂

= sup

s∈S

exp(MR)

∑︂

m=1

exp(−MR)Ps(︁m=DM(YM)|XM=Cc,C(m))︁

(b)

≤sup

s∈S

exp(MR)

∑︂

m=1

Cc,C(m)=C(m)

exp(−MR)Ps(︁m=DM(YM)|XM=Cc,C(m))︁+

exp(MR)

∑︂

m=1

Cc,C(m)=C(m)

exp(−MR)

(a)

≤δC+ exp(−Mγ1

ˆ ),(4.76)

where the steps marked with (a) are by the definition of compound coding error, and (b)

is by upper bounding some of the probabilities in the sum with 1. We can now choose

γ1∈(0, γ1

˜ ) and obtain

PC(︁δCc,C≥exp(−Mγ1))︁(a)

≤PC(︁δCc,C≥exp(−Mγ1)|¬E

ˆ)︁+PC(E

ˆ)

(4.76)

≤PC(︁δC+ exp(−Mγ1

ˆ ) ≥exp(−Mγ1)|¬E

ˆ)︁+PC(E

ˆ)

(a)

≤PC(︁δC≥exp(−Mγ1)−exp(−Mγ1

ˆ ))︁

1−PC(E

ˆ)+PC(E

ˆ)

(b)

≤PC(︁δC≥exp(−Mγ1

˜ ))︁

1−PC(E

ˆ)+PC(E

ˆ)

(4.74),(4.75)

≤exp(−Mγ2

˜ )

1−exp(−exp(Mγ2

ˆ )) + exp(−exp(Mγ2

ˆ ))

(c)

≤exp(−Mγ2),

where the steps marked with (a) are by the law of total probability, step (b) is by the

choices of γ1, γ1

˜ , and step (c) is valid for any choice of γ2∈(0, γ2

˜ ).

Proof of Corollary 8. By Lemma 21, we pick γ1

ˆ, γ2

ˆ satisfying (4.73) and by Theorem 6,

we pick γ1

˜, γ2

˜ satisfying (4.20).

We use the observation that N≤exp((R−γ1

ˆ )M) implies

∥R

ˆCc,C−R

ˆC∥TV ≤N

exp(MR)≤exp(−γ1

ˆM) (4.77)

104

4.6 Proofs

and observe that, as long as γ1< γ1

ˆ, γ1

˜ and γ2< γ2

ˆ, γ2

˜ and Mis sufficiently large,

PCc,C(︂∥R

ˆCc,C−QM

P,W ∥TV >exp(−γ1M))︂

(a)

≤PC(︂∥R

ˆCc,C−R

ˆC∥TV +∥R

ˆC−QM

P,W ∥TV >exp(−γ1M))︂

(b)

≤PC(︂∥R

ˆCc,C−R

ˆC∥TV >exp(−γ1

ˆM))︂+PC(︂∥R

ˆC−QM

P,W ∥TV >exp(−γ1

˜M))︂

(c)

<exp(−exp(γ2

ˆM)) + exp(−exp(γ2

˜M))

(d)

≤exp(−exp(γ2M)),

where (a) is by the triangle inequality, (b) is by the union bound and the choice of γ1, (c)

is due to (4.73), (4.77) and (4.20), and (d) is by the choice of γ2.

105

Publication List

[1] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Resolvability on continuous

alphabets. In 2018 IEEE International Symposium on Information Theory (ISIT),

pages 2037–2041. IEEE, 2018.

[2] Igor Bjelakovi´c, Matthias Frey, and Slawomir Sta´nczak. Distributed approximation

of functions over fast fading channels with applications to distributed learning and

the max-consensus problem. In 2019 57th Annual Allerton Conference on Commu-

nication, Control, and Computing (Allerton), pages 1146–1153, Sep. 2019.

[3] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Over-the-air computation

in correlated channels. In 2020 IEEE Information Theory Workshop (ITW). IEEE,

2021.

[4] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Over-the-air computation in

correlated channels. IEEE Transactions on Signal Processing, 69:5739–5755, 2021.

[5] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Towards secure over-the-air

computation. In 2021 IEEE International Symposium on Information Theory (ISIT),

pages 700–705. IEEE, 2021.

[6] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Towards secure over-the-air

computation. Submitted to IEEE Transactions on Information Theory, 2022. Preprint

available at arXiv:2001.03174.

[7] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Over-the-air computation

for distributed machine learning and consensus in large wireless networks. In Gitta

Kutyniok, Holger Rauhut, and Robert J. Kunsch, editors, Compressed Sensing in

Information Processing. Springer, 2021. To appear.

[8] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. MAC resolvability: First

and second order results. In 2017 IEEE Conference on Communications and Network

Security (CNS), pages 560–564. IEEE, 2017.

107

Publication List

[9] Navneet Agrawal, Matthias Frey, and Slawomir Sta´nczak. A scalable max-consensus

protocol for noisy ultra-dense networks. In 2019 IEEE 20th International Workshop

on Signal Processing Advances in Wireless Communications (SPAWC), pages 1–5.

IEEE, 2019.

[10] Zoran Utkovski, Patrick Agostini, Matthias Frey, Igor Bjelakovi´c, and Slawomir

Sta´nczak. Learning radio maps for physical-layer security in the radio access. In

2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless

Communications (SPAWC), pages 1–5. IEEE, 2019.

[11] Gutierrez-Estevez, Miguel A., Zoran Utkovski, Patrick Agostini, Daniel Sch¨aufele,

Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Quality-of-service prediction

for physical-layer security via secrecy maps. In 2020 IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP), pages 2867–2871. IEEE, 2020.

[12] Zoran Utkovski, Matthias Frey, Patrick Agostini, Igor Bjelakovi´c, and Slawomir

Sta´nczak. Semantic security based secrecy maps for vehicular communications. In

25th International ITG Workshop on Smart Antennas (WSA), pages 307–311. ITG,

2021.

[13] Matthias Frey, Igor Bjelakovi´c, Janis N¨otzel, and Slawomir Sta´nczak. Semantic secu-

rity with infinite dimensional quantum eavesdropping channel. In Submitted to 2022

IEEE Information Theory Workshop (ITW). IEEE, 2022.

108

Bibliography

[ADG19] M. M. Amiri, T. M. Duman, and D. G¨und¨uz. Collaborative machine learning

at the wireless edge with blind transmitters. In 2019 IEEE Global Conference

on Signal and Information Processing (GlobalSIP). IEEE, 2019.

[AG19] M. M. Amiri and D. G¨und¨uz. Computation scheduling for distributed ma-

chine learning with straggling workers. IEEE Transactions on Signal Pro-

cessing, 67(24):6270–6284, 2019.

[AG20a] M. M. Amiri and D. G¨und¨uz. Federated learning over wireless fading chan-

nels. IEEE Transactions on Wireless Communications, 19(5):3546–3557,

2020.

[AG20b] M. M. Amiri and D. G¨und¨uz. Machine learning at the wireless edge: Dis-

tributed stochastic gradient descent over-the-air. IEEE Transactions on Sig-

nal Processing, 68:2155–2169, 2020.

[Ahl67] R. Ahlswede. Certain results in coding theory for compound channels. In

Proceedings of the Colloquium on Information Theory, Debrecen, Hungary,

pages 35–60, 1967.

[AOGE20] M. S. H. Abad, E. Ozfatura, D. G¨und¨uz, and O. Ercetin. Hierarchical feder-

ated learning across heterogeneous cellular networks. In ICASSP 2020-2020

IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 8866–8870. IEEE, 2020.

[Arn57] V. Arnold. On functions of three variables. Doklady Akademii Nauk SSSR,

114:679–681, 1957.

[ASK19] J.-H. Ahn, O. Simeone, and J. Kang. Wireless federated distillation for dis-

tributed edge learning with heterogeneous data. In 2019 IEEE 30th Annual

International Symposium on Personal, Indoor and Mobile Radio Communi-

cations (PIMRC). IEEE, 2019.

[BBT59] D. Blackwell, L. Breiman, and A. J. Thomasian. The capacity of a class of

channels. The Annals of Mathematical Statistics, 30(4):1229–1241, 1959.

109

Bibliography

[BCD89] R. Bhatia, M.-D. Choi, and C. Davis. Comparing a matrix to its off-diagonal

part. In H. Dym, S. Goldberg, M. A. Kaashoek, and P. Lancaster, editors,

The Gohberg Anniversary Collection: Volume I: The Calgary Conference and

Matrix Theory Papers, pages 151–164. Birkh¨auser, Basel, Switzerland, 1989.

[BCH08] L. Brunet, H.-L. Choi, and J. How. Consensus-based auction approaches for

decentralized task assignment. In AIAA guidance, navigation and control

conference and exhibit, 2008.

[Bha97] R. Bhatia. Matrix analysis, volume 169 of Graduate Texts in Mathematics.

Springer, New York, New York, United States of America, 1997.

[Bil95] P. Billingsley. Probability and Measure. Wiley Series in Probability and Math-

ematical Statistics. Wiley, New York, New York, United States of America,

3rd edition, 1995.

[BK00] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Vari-

ables and Random Processes. Cross Cultural Communication. American

Mathematical Society, Providence, Rhode Island, United States of America,

2000.

[BL13] M. R. Bloch and J. N. Laneman. Strong secrecy from channel resolvability.

Transactions on Information Theory, 59(12):8077–8098, 2013.

[BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. A

Nonasymptoitic Theory of Independence. Oxford University Press, Oxford,

United Kingdom, 2013.

[BRB93] K. L. Blackard, T. S. Rappaport, and C. W. Bostian. Measurements and

models of radio frequency impulsive noise for indoor wireless communications.

IEEE Journal on Selected Areas in Communications, 11(7):991–1001, 1993.

[BS92] J. A. Benediktsson and P. H. Swain. Consensus theoretic classification meth-

ods. IEEE Transactions on Systems, Man, and Cybernetics, 22(4):688–704,

1992.

[BTV12] M. Bellare, S. Tessaro, and A. Vardy. Semantic security for the wiretap chan-

nel. In Advances in Cryptology – CRYPTO 2012, pages 294–311. Springer,

Berlin and Heidelberg, Germany, 2012.

[Buc76] R. C. Buck. Approximate complexity and functional representation. Techni-

cal Report 1656, Wisconsin University Madison Mathematics Research Cen-

ter, 1976.

110

Bibliography

[Buc82] R. C. Buck. Nomographic functions are nowhere dense. Proceedings of the

American Mathematical Society, 85(2):195–199, 1982.

[CH12] A. Christmann and R. Hable. Consistency of support vector machines us-

ing additive kernels for additive models. Computational Statistics & Data

Analysis, 56(4):854–873, 2012.

[CK11] I. Csisz´ar and J. K¨orner. Information theory: Coding Theorems for Dis-

crete Memoryless Systems. Cambridge University Press, Cambridge, United

Kingdom, 2011.

[Csi96] I. Csisz´ar. Almost independence and secrecy capacity. Problems of Informa-

tion Transmission, 32(1):40–47, 1996.

[CT20] W.-T. Chang and R. Tandon. Communication efficient federated learning

over multiple access channels. arXiv preprint arXiv:2001.08737, 2020.

[Cuf16] P. Cuff. Soft covering with high probability. In 2016 IEEE International

Symposium on Information Theory (ISIT), pages 2963–2967. IEEE, 2016.

[Dev05] I. Devetak. The private classical capacity and quantum capacity of a quantum

channel. Transactions on Information Theory, 51(1):44–55, 2005.

[Dob59] R. L. Dobrushin. Optimum information transmission through a channel with

unknown parameters. Radio Engineering and Electronic Physics, 4(12):1–8,

1959.

[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis

of Randomized Algorithms. Cambridge University Press, Cambridge, United

Kingdom, 2009.

[DSD20] J. Dong, Y. Shi, and Z. Ding. Blind over-the-air computation and data

fusion via provable wirtinger flow. IEEE Transactions on Signal Processing,

68:1136–1151, 2020.

[EG59] E. Eisenberg and D. Gale. Consensus of subjective probabilities: The pari-

mutuel method. The Annals of Mathematical Statistics, 30(1):165–168, 1959.

[EGK11] A. El Gamal and Y.-H. Kim. Network Information Theory. Cambridge

University Press, Cambridge, United Kingdom, 2011.

[FLNS10] H. Ferreira, L. Lampe, J. Newbury, and T. G. Swart, editors. Power Line

Communications. Theory and Applications for Narrowband and Broadband

111

Bibliography

Communications over Power Lines. Wiley, Chichester, United Kingdom,

2010.

[Fre85] S. French. Group consensus probability distributions: a critical survey. In

J. M. Bemado, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, edi-

tors, Bayesian Statistics 2: Proceedings of the Second Valencia International

Meeting. Elsevier and Valencia University Press, Amsterdam, The Nether-

lands and Valencia, Spain, 1985.

[GBS13] M. Goldenbaum, H. Boche, and S. Sta´nczak. Harnessing interference for

analog function computation in wireless sensor networks. IEEE Transactions

on Signal Processing, 61(20):4893–4906, 2013.

[GBS14] M. Goldenbaum, H. Boche, and S. Sta´nczak. Nomographic functions: Effi-

cient computation in clustered gaussian sensor networks. IEEE Transactions

on Wireless Communications, 14(4):2093–2105, 2014.

[GdKS+19] D. G¨und¨uz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy, and

M. van der Schaar. Machine learning in the air. IEEE Journal on Selected

Areas in Communications, 37(10):2184–2199, 2019.

[Gil11] M. Gil. On R´enyi divergence measures for continuous alphabet sources. Mas-

ter’s thesis, Queen’s University Kingston, Ontario, Canada, 2011.

[GJRM+16] M. Goldenbaum, P. Jung, M. Raceala-Motoc, J. Schreck, S. Sta´nczak, and

C. Zhou. Harnessing channel collisions for efficient massive access in 5G

networks: A step forward to practical implementation. In 2016 9th Inter-

national Symposium on Turbo Codes and Iterative Information Processing

(ISTC), pages 335–339. IEEE, 2016.

[GS13] M. Goldenbaum and S. Sta´nczak. Robust analog function computation via

wireless multiple-access channels. IEEE Transactions on Communications,

61(9):3863–3877, 2013.

[GS14] M. Goldenbaum and S. Sta´nczak. On the channel estimation effort for analog

computation over wireless multiple-access channels. IEEE Wireless Commu-

nications Letters, 3(3):261–264, 2014.

[GV03] M. Gastpar and M. Vetterli. Source-channel communication in sensor net-

works. In F. Zhao and L. Guibas, editors, Information Processing in Sensor

Networks, pages 162–177, Berlin and Heidelberg, Germany, 2003. Springer.

112

Bibliography

[Hay06] M. Hayashi. General nonasymptotic and asymptotic formulas in channel

resolvability and identification capacity and their application to the wiretap

channel. Transactions on Information Theory, 52(4):1562–1575, 2006.

[Hil00] D. Hilbert. Mathematische Probleme. Vortrag, gehalten auf dem interna-

tionalen Mathematiker-Kongreß zu Paris 1900. Nachrichten von der K¨onigl.

Gesellschaft der Wissenschaften zu G¨ottingen, pages 253–297, 1900.

[Hil27] D. Hilbert. ¨

Uber die Gleichung neunten Grades. Mathematische Annalen,

97(1):243–250, 1927.

[HK14] J. Hou and G. Kramer. Effective secrecy: Reliability, confusion and stealth.

In 2014 IEEE International Symposium on Information Theory, pages 601–

605. IEEE, 2014.

[HM16] M. Hayashi and R. Matsumoto. Secure multiplex coding with dependent and

non-uniform multiple messages. IEEE Transactions on Information Theory,

62(5):2355–2409, 2016.

[HV93] T. S. Han and S. Verd´u. Approximation theory of output statistics. Trans-

actions on Information Theory, 39(3):752–772, 1993.

[ICJ12] F. Iutzeler, P. Ciblat, and J. Jakubowicz. Analysis of max-consensus al-

gorithms in wireless channels. IEEE Transactions on Signal Processing,

60(11):6103–6107, 2012.

[Jay03] E. Jaynes. Probability Theory: The Logic of Science. Cambridge University

Press, Cambridge, United Kingdom, 2003.

[JKB94] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distri-

butions, volume 1 of Wiley Series in Probability and Mathematical Statistics.

Wiley, New York, New York, United States of America, 2nd edition, 1994.

[Kes61] H. Kesten. Some remarks on the capacity of compound channels in the

semicontinuous case. Information and Control, 4(2-3):169–184, 1961.

[KM79] J. Korner and K. Marton. How to encode the modulo-two sum of binary

sources (corresp.). IEEE Transactions on Information Theory, 25(2):219–

221, 1979.

[KMRR16] J. Koneˇcn`y, H. B. McMahan, D. Ramage, and P. Richt´arik. Federated op-

timization: Distributed machine learning for on-device intelligence. arXiv

preprint arXiv:1610.02527, 2016.

113

Bibliography

[Kol57] A. N. Kolmogorov. On the representation of continuous functions of sev-

eral variables by superposition of continuous functions of one variable and

addition. Doklady Akademii Nauk SSSR, 114:953–956, 1957.

[LZLV20] W. Liu, X. Zang, Y. Li, and B. Vucetic. Over-the-air computation systems:

Optimization, analysis and scaling laws. IEEE Transactions on Wireless

Communications, 19(8):5488–5502, 2020.

[MASR21] F. Molinari, N. Agrawal, S. Sta´nczak, and J. Raisch. Max-consensus over

fading wireless channels. IEEE Transactions on Control of Network Systems,

8(2):791–802, 2021.

[Mau94] U. M. Maurer. The strong secret key rate of discrete random triples. In

R. E. Blahut, D. J. Costello, U. Maurer, and T. Mittelholzer, editors, Com-

munications and Cryptography: Two Sides of One Tapestry, pages 271–285.

Springer, Boston, Massachusetts, United States of America, 1994.

[McD98] C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-

Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Dis-

crete Mathematics, pages 195–248. Springer, Berlin and Heidelberg, Ger-

many, 1998.

[MDR19] F. Molinari, A. M. Dethof, and J. Raisch. Traffic automation in urban road

networks using consensus-based auction algorithms for road intersections.

In 2019 18th European Control Conference (ECC), pages 3008–3015. IEEE,

2019.

[Mid99] D. Middleton. Non-gaussian noise models in signal processing for telecom-

munications: new methods an results for class A and class B noise models.

IEEE Transactions on Information Theory, 45(4):1129–1149, 1999.

[MMR+17] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas.

Communication-efficient learning of deep networks from decentralized data.

In Proceedings of the 20 th International Conference on Artificial Intelligence

and Statistics (AISTATS) 2017, pages 1273–1282, 2017.

[MNT07] G. Mergen, V. Naware, and L. Tong. Asymptotic detection performance of

type-based multiple access over multiaccess fading channels. IEEE Transac-

tions on Signal Processing, 55(3):1081–1092, 2007.

[MR17] B. McMahan and D. Ramage. Federated learning: Collabora-

tive machine learning without centralized training data, 2017.

114

Bibliography

Google AI Blog. Available at https://ai.googleblog.com/2017/04/

federated-learning-collaborative.html, retrieved 02 March 2021.

[MRT12] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine

Learning. Adaptive Computation and Machine Learning. MIT Press, Cam-

bridge, Massachusetts, United States of America and London, United King-

dom, 2012.

[MS93] D. Middleton and A. D. Spaulding. Elements of weak signal detection in

non-gaussian noise environments. In V. Poor and J. B. Thomas, editors,

Advances in Statistical Signal Processing, volume 2, pages 137–215. JAI Press,

Greenwich, Connecticut, United States of America, 1993.

[MSR18] F. Molinari, S. Sta´nczak, and J. Raisch. Exploiting the superposition prop-

erty of wireless communication for average consensus problems in multi-agent

systems. In 2018 European Control Conference (ECC), pages 1766–1772.

IEEE, 2018.

[MT06] G. Mergen and L. Tong. Type based estimation over multiaccess channels.

IEEE Transactions on Signal Processing, 54(2):613–626, 2006.

[NCNC16] B. Nazer, V. R. Cadambe, V. Ntranos, and G. Caire. Expanding the compute-

and-forward framework: Unequal powers, signal levels, and multiple linear

combinations. IEEE Transactions on Information Theory, 62(9):4879–4909,

2016.

[NG05] R. Negi and S. Goel. Secret communication using artificial noise. In IEEE

62nd Vehicular Technology Conference (VTC), volume 3, pages 1906–1910.

IEEE, 2005.

[NG07] B. Nazer and M. Gastpar. Computation over multiple-access channels. IEEE

Transactions on Information Theory, 53(10):3498–3516, 2007.

[NG11] B. Nazer and M. Gastpar. Compute-and-forward: Harnessing interfer-

ence through structured codes. IEEE Transactions on Information Theory,

57(10):6463–6486, 2011.

[OS06] R. Olfati-Saber. Flocking for multi-agent dynamic systems: Algorithms and

theory. IEEE Transactions on Automatic Control, 51(3):401–420, 2006.

[OSFM07] R. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus and cooperation

in networked multi-agent systems. Proceedings of the IEEE, 95(1):215–233,

2007.

115

Bibliography

[OUG19] E. Ozfatura, S. Ulukus, and D. G¨und¨uz. Distributed gradient descent with

coded partial gradient computations. In 2019 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing (ICASSP), pages 3492–3496.

IEEE, 2019.

[OZE+11] O. Ordentlich, J. Zhan, U. Erez, M. Gastpar, and B. Nazer. Practical code

design for compute-and-forward. In 2011 IEEE International Symposium on

Information Theory (ISIT), pages 1876–1880. IEEE, 2011.

[P+11] F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of

Machine Learning Research, 12:2825–2830, 2011.

[RGS16] K. Ralinovski, M. Goldenbaum, and S. Sta´nczak. Energy-efficient classifica-

tion for anomaly detection: The wireless channel as a helper. In 2016 IEEE

International Conference on Communications (ICC), 2016.

[RV68] W. L. Root and P. P. Varaiya. Capacity of classes of gaussian channels. SIAM

Journal on Applied Mathematics, 16(6):1350–1393, 1968.

[SC08] I. Steinwart and A. Christmann. Support Vector Machines. Information Sci-

ence and Statistics. Springer, New York, New York, United States of America,

2008.

[SC20] T. Sery and K. Cohen. On analog gradient descent learning over multiple

access fading channels. IEEE Transactions on Signal Processing, 68:2897–

2911, 2020.

[Sha49] C. E. Shannon. Communication theory of secrecy systems. Bell Labs Tech-

nical Journal, 28(4):656–715, 1949.

[STL20] M. Seif, R. Tandon, and M. Li. Wireless federated learning with local dif-

ferential privacy. In 2020 IEEE International Symposium on Information

Theory (ISIT), pages 2604–2609, 2020.

[SY12] I. Stanojev and A. Yener. Improving secrecy rate via spectrum leasing

for friendly jamming. IEEE Transactions on Wireless Communications,

12(1):134–145, 2012.

[SZG20] Y. Sun, S. Zhou, and D. G¨und¨uz. Energy-aware analog aggregation for feder-

ated learning with redundant data. In ICC 2020 - 2020 IEEE International

Conference on Communications (ICC), 2020.

116

Bibliography

[VBBM10] J. P. Vilela, M. Bloch, J. Barros, and S. W. McLaughlin. Friendly jamming for

wireless secrecy. In 2010 IEEE International Conference on Communications

(ICC). IEEE, 2010.

[VBBM11] J. P. Vilela, M. Bloch, J. Barros, and S. W. McLaughlin. Wireless secrecy

regions with friendly jamming. IEEE Transactions on Information Forensics

and Security, 6(2):256–266, 2011.

[vEH14] T. van Erven and P. Harremos. R´enyi divergence and Kullback-Leibler diver-

gence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.

[Ver18] R. Vershynin. High Dimensional Probability: An Introduction with Appli-

cations in Data Science, volume 47 of Cambridge Series in Statistical and

Probabilistic Mathematics. Cambridge University Press, Cambridge, United

Kingdom, 2018.

[Wai19] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic View-

point. Cambridge Series in Statistical and Probabilistic Mathematics. Cam-

bridge University Press, Cambridge, United Kingdom, 2019.

[Win68] R. L. Winkler. The consensus of subjective probability distributions. Man-

agement Science, 15(2):B–61, 1968.

[Wol59] J. Wolfowitz. Simultaneous channels. Archive for Rational Mechanics and

Analysis, 4(1):371–386, 1959.

[Wyn75a] A. Wyner. The common information of two dependent random variables.

Transactions on Information Theory, 21(2):163–179, 1975.

[Wyn75b] A. D. Wyner. The wire-tap channel. Bell Labs Technical Journal, 54(8):1355–

1387, 1975.

[YC16] X. Yin and X. Cheng. Propagation Channel Characterization, Parameter

Estimation and Modelling for Wireless Communications. Wiley and IEEE

Press, Singapore, 2016.

[YJSD20] K. Yang, T. Jiang, Y. Shi, and Z. Ding. Federated learning via over-the-air

computation. IEEE Transactions on Wireless Communications, 19(3):2022–

2035, 2020.

[YLCT19] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept

and applications. ACM Transactions on Intelligent Systems and Technology

(TIST), 10(2):1–19, 2019.

117

Bibliography

[Yos65] K. Yoshihara. Coding theorems for the compound semi-continuous memory-

less channels. Kodai Mathematical Seminar Reports, 17(1):30–43, 1965.

[ZCL+19] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang. Edge intelligence:

Paving the last mile of artificial intelligence with edge computing. Proceedings

of the IEEE, 107(8):1738–1762, 2019.

[ZDHL20] Q. Zeng, Y. Du, K. Huang, and K. K. Leung. Energy-efficient radio resource

allocation for federated edge learning. In 2020 IEEE International Conference

on Communications Workshops (ICC Workshops), 2020.

[ZLD+20] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang. Toward an intelligent

edge: Wireless communication meets machine learning. IEEE Communica-

tions Magazine, 58(1):19–25, 2020.

[ZLL+18] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated

learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.

[ZNGE09] J. Zhan, B. Nazer, M. Gastpar, and U. Erez. MIMO compute-and-forward.

In 2009 IEEE International Symposium on Information Theory (ISIT), pages

2848–2852. IEEE, 2009.

[ZWH20] G. Zhu, Y. Wang, and K. Huang. Broadband analog aggregation for low-

latency federated edge learning. IEEE Transactions on Wireless Communi-

cations, 19(1):491–506, 2020.

[ZZ20] A. R. Zhang and Y. Zhou. On the non-asymptotic and sharp lower tail

bounds of random variables. Stat, 9(1):e314, 2020.

118

Notations and Symbols

¯complex conjugate. 37

≪absolute continuity relation of measures. 79

∥·∥FFrobenius norm on matrices. 27

∥·∥op operator norm on matrices. 27

∥·∥TV total variation norm of signed measures. 72

1·indicator function. 56

Afading-generating matrix. 22

A1,...,AKtransmitters. 74

Bnoise-generating matrix. 22

Breceiver. 67

(c,C) additive input cost constraint for a channel. 79

Ccodebook. 79

Cc,Ccost-constrained codebook. 80

Ccomplex numbers. 21

DMpost-processing/decoding operation of the receiver for a scheme with block length

M. 24, 78

Dα(µ||ν) R´enyi divergence of order αbetween measures µand ν. 78

D1(µ||ν) Kullback-Leibler divergence between measures µand ν. 78

kpre-processing/encoding operation of transmitter kfor a scheme with block length

M. 24, 78

Eeavesdropper. 71

119

Notations and Symbols

exp(·) exponential function with basis e. 20

e≈2.71828 Euler’s number. 5, 20

f:S1×. . . ×SK→Robjective function to be approximated. 2, 24

fk:Sk→Rinner function of nomographic representation. 2, 3, 25

F:A→Router function of nomographic representation. 2, 3, 25

˜estimator at the receiver for f(s1, . . . , sK).. 25

FK,lin class of generalized linear functions. 25

Fmon class of functions to be approximated with DFA schemes. 25

Gsub-Gaussian vector with independent entries. 22

geavesdropper’s objective. 71

Hk(m) fading coefficient for transmitter kat channel use m. 21

Hvector of all fading coefficients. 21

Hreproducing kernel Hilbert space. 53

hAB, hAE, hJB, hJE fading coefficients of the channels between transmitter / jammer and

legitimate receiver/eavesdropper. 74

iP,W (xM;yM) information density of input and output tuples of the channel Wunder

input distribution P. 78

IP,W mutual information of the input and output of the channel Wunder input distribu-

tion P. 78

idnidentity matrix of dimension n×n. 30

·Im imaginary part of a complex number. 21

Jjammer. 71

Knumber of transmitters. 6, 21

Lloss function. 11

log(·) natural logarithm with basis e. 4

120

Notations and Symbols

Mcodebook block length / number of times the channel is used in a particular commu-

nication scheme. 7, 21

M(f, ε, δ) communication cost for approximating f. 25

N(m) additive noise at channel use m. 21

Nvector of all noise realizations. 22

N(µ, Σ) multivariate normal distribution with mean µand covariance matrix Σ. 47

(P, M, R)-ensemble random codebook ensemble with input distribution P, block length

Mand rate R. 79

Ppower constraint. 21

PAtransmitter power constraint. 74

PJjammer power constraint. 74

Pchannel input distribution. 77

QP,W joint input-output distribution of channel Wunder input distribution P. 77

Rrate of a codebook. 78

RP,W output distribution of channel Wunder input distribution P. 77

ˆW,Cmarginal distribution of the output of the channel Winduced by the codebook C.

ˆs1,...,sKideal eavesdropper output distribution induced by DFA scheme and jamming

strategy. 71

Rreal numbers. 2

·Re real part of a complex number. 21

Symn

+symmetric, positive semidefinite n×nmatrices with real entries. 84

Symn

++ symmetric, positive definite n×nmatrices with real entries. 84

·Ttranspose of a matrix or vector. 22

Tk(m), Tkchannel input symbol. 21, 89

tr(·) trace of matrices. 39

121

Notations and Symbols

Uk(1), . . . , Uk(M) randomness used in pre-processing. 24

Wchannel. 6, 77

(Ws)s∈S compound channel. 78

ˆη,j)J

j=1 sequence of channels that approximate the compound channel (Ws)s∈S. 79

WBlegitimate user’s effective channel. 80

WEeavesdropper’s effective channel. 80

Xfeature alphabet. 11

Xchannel input alphabet. 78

Ylabel alphabet. 11

Ychannel output alphabet. 78

Zeavesdropper’s channel output alphabet. 71

∆

¯(f) total spread of the inner part of f. 26

∆(f) maximum spread of the inner part of f. 26

∆(f∥P) relative spread of the inner part of fwith respect to power constraint P. 26

θ(X) sub-exponential norm of the random variable X. 43

ϑjamming decoder. 71

κreproducing kernel. 53

π≈3.14159 area of the unit circle. 97

σ2

eff,Blegitimate receiver’s effective noise-to-signal ratio. 75

σ2

eff,Eeavesdropper’s effective noise-to-signal ratio. 75

τ(X) sub-Gaussian norm of the random variable X. 20, 43

Φ increment majorant. 25

ΦNcumulative distribution function of the standard normal distribution. 75

φNprobability density function of the standard normal distribution. 75

122

Abbreviations

AWGN Additive White Gaussian Noise.

CSI channel state information.

DFA Distributed Function Approximation.

FL Federated Learning.

HFL Horizontal Federated Learning.

i.i.d. independent and identically distributed.

IQ inphase-quadrature.

MAC multiple-access channel.

ML Machine Learning.

MSE mean square error.

OTA Over-the-Air.

SVM Support Vector Machine.

TDMA Time Division Multiple Access.

VFL Vertical Federated Learning.

123

Index

ε-approximation of a function, 25

absolute continuity, 79

AdaBoost, 56

block fading channel, 23, 30

boosting, 55

Bounded Differences Inequality, 41

Central Limit Theorem, 23

channel resolvability, 68, 69, 81

communication cost for approximating a

function, 25

compound channel, 78

(η, J)-approximation, 79

compound channel code, 78

compound channel coding, 68, 69, 81

continuous case, 69

finite alphabets, 69, 82

semicontinuous case, 69

computation coding, 6

confidence level, 25

consensus problem, 13

correlated fading, 23

correlated noise, 23

cost constraint, 79

cost-constrained codebook, 80

cross-layer methods, 10

cutoff percentage, 58

decision tree, 58

DFA, see Distributed Function

Approximation

Distributed Function Approximation,

19, 24

with jamming, 71

distributed Machine Learning, 11

distributed optimization, 12

eavesdropper’s objective, 71

effective channel, 80

ensemble of random codebooks, see

random codebook ensemble

equal majority vote, 56

feature alphabet, 11

friendly jamming, 69

Gaussian-to-impulsive power ratio, 58

generalized linear function, 25

Hanson-Wright inequality, 38

HFL, see Horizontal Federated Learning

Horizontal Federated Learning, 12

impulsive index, 58

increment majorant, 26

information, seemutual information78

information density, 77

inner function, 25

innerFunction, 3

interference, 23

125

Index

jamming decoding function, 71

jamming strategy, 71

joint source-channel coding, 6, 10

kernel, 53

Kullback-Leibler divergence, 78

label alphabet, 11

labeling function, 11

limited mobility, 23

loss, 11

Lipschitz-continuous, 52

max-spread, 26

Middleton noise, 58

MSE security, 72

multi-step protocols, 14

mutual information, 78

nomographic function, 2

nomographic representation, 2

non-Gaussian fading, 23

non-Gaussian noise, 23

OTA computation, see Over-the-Air

computation

outer function, 3, 25

Over-the-Air computation, 5

analog, 8

digital, 6

physical layer network coding, 8

physical layer security, 69

power constraint

peak, 21

random codebook ensemble, 79

relative spread, 26

reproducing kernel Hilbert space, 53

risk, 11

R´enyi divergence, 78

semantic security, 69, 71

shared randomness, 77

statistical inference problem, 11

stochastic gradient descent, 12

sub-exponential, 43

sub-Gaussian, 20, 43

support vector machine, 52

SVM, see support vector machine

synchronization error, 62

TDMA, see time division multiple access

thermal noise, 23

time division multiple access, 32, 59

total spread, 26

total variation norm, 72

training procedure, 11

training sample, 11

Type-Based Multiple-Access, 13

user-independent fading, 22

Vertical Federated Learning, 12, 51

VFL, see Vertical Federated Learning

126