scieee Science in your language
[en] (orig)
Adaptive Parameter Servers
vorgelegt von
Florian Alexander Renz-Wieland, MSc
an der Fakultät IV Elektrotechnik und Informatik
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doctor rerum naturalium
(Dr. rer. nat.)
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Klaus-Robert Müller
Gutachter: Prof. Dr. Volker Markl
Gutachter: Prof. Dr. Rainer Gemulla
Gutachter: Prof. Dr. Matthias Boehm
Gutachter: Prof. Dr. Tilmann Rabl
Tag der wissenschaftlichen Aussprache:
16. Dezember 2022
Berlin 2023
Abstract
Machine learning (ML) has become an essential tool for solving problems that
have traditionally been challenging for computers, for example in the fields
of natural language processing, computer vision, and recommender systems.
For large ML tasks, distributed training has become a necessity for keeping
up with increasing dataset sizes and model complexity. A key challenge in
distributed training is to synchronize model parameters among cluster nodes
and to do so efficiently. Parameter servers (PSs) facilitate the implementation
of distributed training by providing cluster-wide read and write access to the
parameters, and transparently handling partitioning and synchronization in
the background (either among the cluster nodes directly or via physically
separate server nodes).
In this thesis, we study the efficiency of PSs for ML tasks with sparse
parameter access, i.e., tasks in which each update step reads and writes only a
small (or tiny) part of the model. In a first step, we find that existing PSs are
inefficient for such tasks: in our experiments, distributed implementations
were slower than efficient single node implementations due to communica-
tion overhead. This inefficiency dramatically limits the utility of PSs and
distributed training in general. With such inefficiency, distributed training in
practice will be used only when it is indispensable, e.g., for very large models.
And when distributed training is indeed employed, its inefficiency squanders
hardware and energy resources.
Starting from this observation, we investigate whether and how PS effi-
ciency can be improved. The main idea of this thesis is to increase efficiency
by making the PS adapt to the underlying ML task. We first present and eval-
uate a series of potential performance improvements in this direction, each
making the PS more adaptive. In particular, we explore (i) to dynamically
adapt the allocation of model parameters, i.e., to relocate parameters among
nodes during training, according to where they are accessed, (ii) to adapt the
management technique of the PS to the access patterns of individual parame-
ters, i.e., to employ a suitable management technique for each parameter, and
(iii) to adapt to the type of a parameter access, by supporting sampling (i.e.,
i
randomized) access directly. We find empirically that each of these aspects
can improve PS efficiency. However, each aspect also makes the PS more
complex to use, because the application (i.e., the component that interacts
with the PS) needs to control adaptivity manually.
To reduce complexity, we develop a mechanism that enables automatic
adaptivity, i.e., adaptivity without requiring the application’s manual control.
With this mechanism, the application merely provides information about
parameter accesses, in a way that naturally integrates into common ML sys-
tems. We describe a novel PS system—called AdaPS—that adapts to ML tasks
automatically based on the information provided by this mechanism. AdaPS
incorporates all adaptivity aspects presented in this thesis. It decides what
to do (i.e., which management technique to use for a specific parameter and
where to allocate each parameter) and when to do so. It does so automatically,
i.e., without further user input, and dynamically, i.e., based on the current
situation. In our experiments, AdaPS enabled efficient distributed ML train-
ing for multiple ML tasks: in contrast to previous PSs, it provided near-linear
speed-ups over efficient single node implementations.
With these results, we argue that PSs can be efficient for sparse ML
tasks, and that this efficiency can be reached with limited additional effort
from application developers. Efficient and easy-to-use PSs make distributed
training (i) attractive for a wide range of use cases—thus enabling solutions to
challenging problems—and (ii) squander fewer of our planets resources.
ii
Zusammenfassung
(German Abstract)
Maschinelles Lernen (ML) ist inzwischen ein essentielles Werkzeug, um
Probleme zu lösen, die für Computer traditionell herausfordernd waren,
zum Beispiel zur Verarbeitung natürlicher Sprache, für Bilderkennung und
zum Generieren von Empfehlungen. Um mit steigenden Datenmengen und
Modellkomplexitäten Schritt zu halten, ist für komplizierte ML-Aufgaben
verteiltes Training notwendig, das heißt, paralleles Training auf mehreren
Computern eines Clusters. Eine zentrale Herausforderung in verteiltem
Training ist die effiziente Synchronisierung der Modellparameter zwischen
den Computern des Clusters. Parameter Server erleichtern die Implemen-
tierung von verteiltem Training, indem sie clusterweiten Lese- und Schreibzu-
griff auf die Modellparameter ermöglichen, und im Hintergrund transparent
Partitionierung und Synchronisation handhaben (entweder direkt zwischen
den Computern des Clusters oder via physisch separate Computer).
In dieser Arbeit untersuche ich die Effizienz von Parameter Servern
für ML-Aufgaben mit dünnbesetztem (engl. “sparse”) Parameterzugriff, also
Aufgaben, bei denen jeder Modell-Update-Schritt nur einen kleinen (oder
winzigen) Teil des Modells liest und schreibt. In einem ersten Schritt stelle
ich fest, dass bestehende Parameter Server für solche Aufgaben ineffizient
sind: In meinen Experimenten waren verteilte Implementierungen (mit 8
Computern) langsamer als effiziente Implementierungen auf einem einzelnen
Computer. Diese Ineffizienz limitiert den Nutzwert von Parameter Servern
und verteiltem Training drastisch. Mit solch hoher Ineffizienz wird verteiltes
Lernen in der Praxis nur eingesetzt wenn es unvermeidbar ist, zum Beispiel
für sehr große Modelle. Und immer wenn verteiltes Training eingesetzt wird,
vergeudet es Hardware- und Energieressourcen.
Auf der Grundlage dieser Beobachtung untersuche ich, ob und wie die
Effizienz von Parameter Servern verbessert werden kann. Die zentrale Idee
dieser Arbeit ist, die Effizienz zu verbessern indem der Parameter Server sich
an die zugrunde liegende ML-Aufgabe anpasst. Ich präsentiere und evaluiere
iii
zunächst eine Reihe von potenziellen Verbesserungen in dieser Richtung, die
den Parameter Server jeweils adaptiver machen. Insbesondere versuche ich, (i)
die Allokation der Modellparameter dynamisch anzupassen, das heißt, die Pa-
rameter während des Trainings zwischen den Knoten zu verschieben, je nach-
dem, wo auf sie zugegriffen wird, (ii) die Parameter-Management-Techniken
des Parameter Servers an die Zugriffsmuster der einzelnen Parameter anzu-
passen, also für jeden Parameter eine für den Parameter geeignete Technik
einzusetzen, und (iii) den Parameter Server an die Art des Parameterzugriffs
anzupassen, indem er stichprobenartige (randomisierte) Zugriffe direkt un-
terstützt. In einer Reihe von Experimenten stelle ich fest, dass jeder dieser
Aspekte die Effizienz von Parameter Servern verbessern kann. Allerdings
verkompliziert jeder Aspekt auch die Nutzung des Parameter Servers, weil
die Anwendung (die Komponente, die mit dem Parameter Server interagiert)
die Anpassungen manuell steuern muss.
Aus diesem Grund entwickle ich einen Mechanismus, der automatische
Anpassung ermöglicht. Die Anwendung stellt über diesen Mechanismus
Informationen über die Parameterzugriffe bereit, und zwar auf eine Art,
die sich nahtlos in gängige ML-Systeme integriert. Darüber hinaus präsen-
tiere ich einen neuartigen Parameter Server (AdaPS), der sich basierend auf
den von diesem Mechanismus bereitgestellten Informationen automatisch
an ML-Aufgaben anpasst. AdaPS vereint alle in dieser Arbeit vorgestellten
Aspekte der Adaptivität. Der Parameter Server entscheidet, was zu tun ist
(das heißt, welche Technik für einen bestimmten Parameter zu verwenden ist
und wo jeder Parameter alloziert werden soll) und wann dies zu tun ist. Dies
geschieht automatisch, also ohne zusätzliche Informationen, und dynamisch,
also basierend auf der aktuellen Situation. In Experimenten ermöglicht
AdaPS effizientes verteiltes ML-Training: Im Gegensatz zu früheren Param-
eter Servern liefert AdaPS nahezu lineare Geschwindigkeitssteigerungen
gegenüber Implementierungen für einzelne Computer.
Aufgrund dieser Ergebnisse argumentiere ich, dass Parameter Server für
dünnbesetzte ML-Aufgaben effizient sein können, und dass diese Effizienz
mit begrenztem zusätzlichem Aufwand für Anwendungsentwickler:innen er-
reicht werden kann. Effiziente und einfach zu verwendende Parameter Server
sorgen dafür, dass verteiltes Training (i) für ein breites Spektrum von Anwen-
dungsfällen attraktiv wird (und so Lösungen für anspruchsvolle Probleme
ermöglicht) und (ii) weniger der Ressourcen unseres Planeten verschwendet.
iv
Acknowledgments
First of all, I would like to thank my thesis advisors, Volker Markl and
Rainer Gemulla. Thank you Volker for providing me with a great research
environment, for always trusting my skills, and for giving me the freedom
and helping me to find my own way of doing things. Thank you Rainer
for inspiring me to enter academia (by teaching an outstanding lecture on
database systems), for instilling in me a wealth of learnings about approaching,
thinking about, and solving problems, and for always providing me with
excellent feedback on my research, my writing, and my talks. I also want to
thank Matthias Böhm and Tilmann Rabl for being part of my committee
and for providing valuable feedback on this thesis. I want to thank Steffen
Zeuch and Zoi Kaoudi, for advising me in my research.
I want to thank all others who have been guiding and supporting me,
in particular in the Database Systems and Information Management group
at TU Berlin. Thanks to Jonas for always creating a friendly atmosphere in
our office, for teaching me how to navigate the research group, saving me
countless hours of figuring things out, and for many interesting discussions
on research, politics, and life. Thank you Claudia, Melanie, and Lutz, for
keeping the place running and for always being helpful. And thank you
to all the other PhD students for making the research group a friendly and
supportive place: Andreas, Gabor, Philipp, Haralampos, Kajetan, Sergey,
Viktor, Clemens, Behrouz, Felix, Ventura, Makis, Rudi, Lennart, Ariane,
Xenofon, Martin, Anastasiia, and Dimitrios. Thank you to Adrian Kochsiek
for an inspiring collaboration on training knowledge graph embeddings. And
thank you to the students whom I had the honor to supervise, in particular
Andreas, Robert, and Tobias. Thank you for many interesting discussions
and teaching me plenty about supervision.
None of this would have been possible without the constant support of
my family, my friends, and my partner (who doubles as a fantastic office mate
when a global pandemic forces you to work from home). Thank you for all
your support and for the great time that we are having together, through
life’s ups and downs. My life is better because of you.
v
vi
Contents
1 Introduction 1
2 Background 7
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . 7
2.1.3 Sparse Parameter Access . . . . . . . . . . . . . . . . . . . 9
2.1.4 Parallel Stochastic Gradient Descent . . . . . . . . . . . 10
2.2 Distributed Parameter Management . . . . . . . . . . . . . . . . . 13
2.2.1 Classic Parameter Server . . . . . . . . . . . . . . . . . . . 15
2.2.2 Static Full Replication . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Selective Replication . . . . . . . . . . . . . . . . . . . . . 18
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Update Compression . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Higher Compute Intensity . . . . . . . . . . . . . . . . . 22
2.3.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Other Task-Specific Approaches . . . . . . . . . . . . . . 23
2.3.5 Faster Interconnect Hardware . . . . . . . . . . . . . . . 24
2.3.6 Decentralized Training . . . . . . . . . . . . . . . . . . . . 24
2.3.7 Programming Abstractions and Scheduling . . . . . . . 25
2.3.8 Key–Value Stores . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Exploiting Locality: Dynamic Parameter Allocation 27
3.1 The Case for Dynamic Parameter Allocation . . . . . . . . . . . 28
3.1.1 PAL Techniques . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Dynamic Parameter Allocation . . . . . . . . . . . . . . 33
3.2 The Lapse Parameter Server . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Parameter Relocation . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Parameter Access . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
Contents
3.2.5 Location Management . . . . . . . . . . . . . . . . . . . . 45
3.2.6 Granularity of Location Management . . . . . . . . . . 46
3.2.7 Important Implementation Aspects . . . . . . . . . . . . 47
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Performance of Classic Parameter Servers . . . . . . . . 51
3.3.3 Effect of Dynamic Parameter Allocation . . . . . . . . 54
3.3.4 Comparison to Manual Management . . . . . . . . . . . 56
3.3.5 Comparison to Replication PSs . . . . . . . . . . . . . . 58
3.3.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.1 Dynamic Parallelism . . . . . . . . . . . . . . . . . . . . . 60
3.4.2 Dynamic Allocation in Key–Value Stores . . . . . . . . 60
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Handling Diversity: Non-Uniform Parameter Management 63
4.1 Non-Uniform Parameter Access . . . . . . . . . . . . . . . . . . . 65
4.1.1 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Multi-Technique Parameter Management . . . . . . . . . . . . . 68
4.2.1
Analysis of Common Parameter Management Tech-
niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Parameter Management in NuPS . . . . . . . . . . . . . 70
4.3 Sampling Management . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Sampling Conformity Levels . . . . . . . . . . . . . . . . 74
4.3.2 Analysis of Common Sampling Schemes . . . . . . . . 76
4.3.3 A Primitive for Sampling . . . . . . . . . . . . . . . . . . 78
4.3.4 The Sampling Manager in NuPS . . . . . . . . . . . . . . 80
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 83
4.4.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . 87
4.4.3 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.5 Effect of Sampling Schemes . . . . . . . . . . . . . . . . . 92
4.4.6 Choice of Management Technique . . . . . . . . . . . . 94
4.4.7 Effect of Replica Staleness . . . . . . . . . . . . . . . . . . 96
4.4.8 Comparison to Task-Specific Implementations . . . . . 98
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
viii
Contents
5 Attaining Ease of Use: Automatic Adaptivity 101
5.1 Efficiency and Complexity of Existing Approaches . . . . . . . 102
5.1.1 Static Full Replication . . . . . . . . . . . . . . . . . . . . 102
5.1.2 Classic PS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.3 Replication PS . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.4 Relocation PS . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.5 Multi-technique PS . . . . . . . . . . . . . . . . . . . . . . 106
5.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Intent Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 The AdaPS Parameter Server . . . . . . . . . . . . . . . . . . . . . 109
5.3.1 Automatic Choice of Technique . . . . . . . . . . . . . . 112
5.3.2 Automatic Action Timing . . . . . . . . . . . . . . . . . . 115
5.3.3 Responsibility Follows Allocation . . . . . . . . . . . . 119
5.3.4 Efficient Communication . . . . . . . . . . . . . . . . . . 122
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 125
5.4.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . 127
5.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.4 Efficiency of Techniques . . . . . . . . . . . . . . . . . . . 132
5.4.5 Effect of Action Timing . . . . . . . . . . . . . . . . . . . 134
5.4.6 AdaPS in Action . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6 Conclusions 143
List of Figures 169
List of Tables 171
List of Algorithms 173
ix
Contents
x
Chapter 1
Introduction
Science is a way of trying not to fool yourself. The principle is that
you must not fool yourself, and you are the easiest person to fool.
Richard Feynman
Machine learning (ML) studies algorithms that “learn to carry out
certain tasks from example data. ML algorithms provide state-of-the-art per-
formance in multiple domain areas, sometimes drastically outperforming
traditional (non-ML) approaches, for example in natural language process-
ing (Bengio et al. 2000; Hochreiter and Schmidhuber 1997; Sutskever et al.
2014; Mikolov et al. 2013; Devlin et al. 2019; Brown et al. 2020), computer
vision (LeCun et al. 1989; Krizhevsky et al. 2012; Simonyan and Zisserman
2015; Ramesh etal.2021; Riquelme etal. 2021) andrecommender systems(Ko-
ren et al. 2009; Das et al. 2007; P.
-
L. Chen et al. 2012). During training, the
parameters of one specific ML model are adjusted from example training data
in iterative update steps. Model training can be time-consuming because (i)
models usually become better the more training data they have been trained
on, such that there is a large number of such update steps, and (ii) larger
models commonly achieve better model quality, but typically require more
complex (i.e., more time-consuming) individual update steps.
Key to keeping up with increasing dataset sizes and model complexity is
distributed model training, i.e., to use multiple nodes of a compute cluster
to parallelize training. Distributed training allows (i) for training faster by
leveraging distributed compute resources, (ii) for training models and data
that exceed the memory capacity of a single node, and (iii) for potentially
using a larger number of cheaper nodes rather than one expensive high-end
node. During training, each node typically accesses a local partition of the
training data, but requires global read and write access to all model parame-
ters. Thus, parameter management among these nodes is a key concern in
1
Chapter 1. Introduction
distributed training. Parameter servers (PS) facilitate the implementation of
distributed training by providing such distributed parameter management:
they transparently partition and synchronize parameters across the nodes
behind primitives for reading and writing parameters. Most ML systems
include such a PS as a core component (Abadi et al. 2016; Paszke et al. 2019;
T. Chen et al. 2015; Lerer et al. 2019; J. K. Kim et al. 2016; Chilimbi et al.
2014), and there are many standalone PSs (Mu Li et al. 2014a; Ho et al. 2013;
Dai et al. 2015; Sergeev and Balso 2018; J. Jiang et al. 2017; Yuzhen Huang
et al. 2018; Jagerman et al. 2017; Z. Zhang et al. 2019; Y. Jiang et al. 2020).
Early PSs stored the parameters on a set of physically separate server nodes
(Smola and Narayanamurthy 2010; Ahmed et al. 2012). In this thesis, we use
the PS term broadly to refer to systems that provide distributed parameter
management, including ones that co-locate the parameters on the nodes that
run the training (Ho et al. 2013; Dai et al. 2015; Yuzhen Huang et al. 2018).
In many ML tasks, the model is large, but accessed sparsely, i.e., each
update step reads and writes only a (small) subset of a large number of model
parameters. For example, such sparse access is common in natural language
processing tasks (Mikolov et al. 2013; Pennington et al. 2014; Peters et al. 2018;
HowardandRuder 2018), knowledgegraphembeddings (Trouillonetal.2016;
Nickel et al. 2011; Balazevic et al. 2019; Bordes et al. 2013; Kazemi and Poole
2018), some graph neural networks (Schlichtkrull et al. 2018; Shang et al.
2019; Vashishth et al. 2020), click-through prediction (H.
-
T. Cheng et al. 2016;
Guo et al. 2017; R. Wang et al. 2017; Zhou et al. 2018), and recommender
systems (Koren et al. 2009; P.
-
L. Chen et al. 2012; Hu et al. 2008). A key
property of sparse ML tasks is that their parameter access pattern is dynamic,
i.e., each node accesses different parameters at different points in time.
In this thesis, westudy theefficiencyof PSsforML tasks with suchsparse
parameter access. In a first step, we find that existing PSs are inefficient: they
barely outperform efficient single node implementations. The reason for this
underwhelming performance is communication overhead for synchronizing
model parameters among the cluster nodes. Based on this observation, we
work on making PSs more efficient. To this end, we present and evaluate
several potential PS performance improvements. The common theme among
these potential improvements is adaptivity, i.e., that the PS adapts to the
underlying ML task. Building on each other, these improvements step by
step make the PS more and more adaptive, and—based on what we see in our
empirical evaluations—more efficient. We first introduce these adaptivity
aspects as PS features that are manually controlled by the application (i.e., the
component that interacts with the PS). Therefore, as we make the PS more
and more adaptive, we also make it more and more complex to use. In a final
2
step, we drastically reduce usage complexity by presenting an approach for
PSs to adapt automatically, without the application’s manual control.
Contributions
In more detail, our main contributions are:
1.
We investigate the efficiency of existing PSs. We find that existing PSs
are scalable, but inefficient. Their performance can even fall behind
that of efficient single node implementations. For example, in several
experiments (see Section 3.3.2), 8 nodes were slower than a single node.
We argue that it is crucial to compare PS performance to efficient single
node baselines to detect and quantify such inefficiency.
2.
We investigate whether dynamically adapting parameter allocation
to the ML task—i.e., to relocate model parameters during run time—
can improve PS efficiency. We find that dynamic allocation allows
for exploiting locality in parameter access, and drastically improves
efficiency for some ML tasks by reducing communication overhead.
3.
We investigate whether adapting the management techniques of the PS
totheaccesspatternsofindividualparameterscanimprovePSefficiency.
I.e., the PS picks a suitable management technique per parameter. We
find that such technique adaptation improves PS efficiency for ML tasks
with non-uniform—i.e., skewed—access frequency distributions.
4.
We investigate whether supporting sampling (i.e., randomized) param-
eter access directly in PSs can improve efficiency. We find that direct
sampling support significantly improves PS efficiency for a range of ML
tasks, in particular tasks that employ negative sampling for many-class
classification or to mitigate an absence of negative training data.
5.
We propose intent signaling, a novel mechanism for passing parameter
access information from the application to the PS in a way that inte-
grates naturally with common ML systems. Intent signaling allows PSs
to adapt to ML tasks automatically, thus relieving applications from
the need to control adaptation manually.
6.
We present and evaluate a fully adaptive, zero-tuning PS called AdaPS.
AdaPS dynamically adapts its management techniques and parameter
allocation to the underlying ML task. It does so automatically, based
only on intent signals. In our experiments, AdaPS was efficient out of
the box (i.e., with zero tuning), providing near-linear speed-ups over
efficient single node implementations.
3
Chapter 1. Introduction
Throughout this thesis, we present three prototype PS systems: Lapse,
NuPS, and finally AdaPS. Each builds on the previous one(s), and explores
and evaluates additional aspects of adaptivity. AdaPS is the final system that
incorporates all the ideas of this thesis. All of our work is publicly available
as open-source software.1
With the above results, we argue that distributed training for sparse ML
tasks can be efficient, and that this efficiency can be reached with limited
additional effort from application developers. Using inefficient PSs makes
distributed training uneconomical for all ML tasks that can be trained with
single node implementations. Consequently, with inefficient PSs, distributed
training is employed in practice mostly when its scalability is indispensable
due to very large datasets or models. The adaptations presented in this
thesis drastically improve PS efficiency for sparse ML tasks, reaching near-
linear speed-ups over efficient single node implementations. This makes
distributed training attractive for many more ML uses cases. In addition,
automatic adaptation makes these efficiency gains accessible without expertise
in distributed systems and without extensive configuration and tuning effort.
Together, these contributions make efficient distributed training of sparse
ML tasks attractive and accessible for a wide range of ML use cases.
Publications
The work presented in this thesis is based on the following publications:
A. Renz-Wieland, R. Gemulla, S. Zeuch, V. Markl. Dynamic Parameter
Allocation in Parameter Servers. PVLDB, 13(11): 1877–1890. 2020.
(Renz-Wieland et al. 2020)
A. Renz-Wieland, R. Gemulla, Z. Kaoudi, V. Markl. NuPS: A Parame-
ter Server for Machine Learning with Non-Uniform Parameter Access.
In Proceedings of the 2022 International Conference of Management of
Data. ACM, New York, NY, USA. 2022. (Renz-Wieland et al. 2022a)
A. Renz-Wieland, A. Kieslinger, R. Gericke, R. Gemulla, Z. Kaoudi,
V. Markl. Good Intentions: Adaptive Parameter Servers via Intent
Signaling. CoRR, abs/2206.00470. 2022. (Renz-Wieland et al. 2022b)
The thesis also draws material from the following publication:
A. Renz-Wieland, T. Drobisch, R. Gemulla, S. Zeuch, V. Markl. Just
Move It! Dynamic Parameter Allocation in Action. PVLDB, 14(12):
2707-2710, 2021. (Renz-Wieland et al. 2021)
1Available at https://github.com/alexrenz/AdaPS.
4
Outline
Chapter 2 introduces basic concepts. Chapter 3 investigates the efficiency of
existing PSs and describes dynamic parameter allocation. In Chapter 4, we
present our work on adapting management techniques and direct sampling
support. Chapter 5 describes automatic adaptivity, i.e., intent signaling and
AdaPS. Chapter 6 concludes this thesis with a summary and a discussion of
open research problems.
5
Chapter 1. Introduction
6
Chapter 2
Background
In this chapter, we introduce concepts that are crucial for understanding this
thesis (Section 2.1), and give an overview of existing work on distributed pa-
rameter management (Sections 2.2) and related work in general (Section 2.3).
2.1 Machine Learning
2.1.1 Basics
ML studies algorithms that “learn to carry out certain tasks from example
data (Mitchell 1997). This stands in contrast to conventional algorithms, in
which software developers explicitly encode how to carry out a task. ML
algorithms are given a set of training examples, commonly referred to as
the training data. The ML algorithm uses these training data to construct a
model. This process is referred to as model training. The goal is that—after
training—the model can be applied to examples that are not included in the
training data. This step is referred to as inference. This thesis focuses on
model training.
MLmodels storetheirlearned informationin model parameters, typically
one or multiple vectors, matrices, or tensors of real numbers. The goal of
model training is to adjust these model parameters such that the model can
carry out the desired task. This training is typically done iteratively, i.e.,
there are many consecutive update steps that adjust the model parameters.
2.1.2 Stochastic Gradient Descent
Currently, the most widely used training algorithms are gradient-based ones
(Ruder 2016). These algorithms aim to minimize a given cost function. During
training, these algorithms iteratively adjust the model parameters based on
the gradient with respect to this cost function. Gradient descent updates the
7
Chapter 2. Background
model parameters based on the exact gradient. To compute this gradient, the
model is evaluated on all examples of the training dataset.
For large datasets, stochastic gradient descent (SGD) and mini-batch SGD
are popular because they can learn faster than full gradient descent (Bottou
et al. 2016). SGD learning algorithms use a stochastic approximation of
the exact gradient to update parameters. In SGD, this approximation is
computed from one single example of the training dataset. In mini-batch
SGD, this approximation is computed from a set—a mini-batch—of examples.
SGD methods commonly learn faster than gradient descent on large datasets
because they can do many—approximate, but cheap—update steps in the time
that it takes to do one gradient descent update step.
One update step of a gradient-based training algorithm has the following
access pattern. It reads the relevant training examples, reads the parameters
that it requires to compute the gradient for these training examples, computes
the gradient, and writes updates to the parameters. Usually, training involves
multiple passes over the training dataset. One pass over the training dataset
is commonly referred to as one epoch.
Algorithm 2.1: Sequential mini-batch SGD.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
W: model parameters
1for epoch 1to num_epochs do
2b=num_batches ( D, batch_size )
//
data loading (pipeline parallel with training, in separate thread(s))
3B= []
4for i1to bdo
5Bi=prepare_batch ( i, D, batch_size, epoch )
// training
6for i1to bdo
7W=compute_update ( Bi,W)
8WW+W
Algorithm 2.1 depicts a simplified example implementation of mini-
batch SGD.
1
The algorithm runs multiple epochs (line 1). In each epoch, a
worker thread iterates over the training data in a sequence of batches. Each
batch is prepared before it is trained on (lines 3–5) (in pipeline parallel fashion,
see below for details). For example, the preparation could consist of reading
and annotating sentences, or loading and cropping images. The worker thread
1SGD can be seen as a special case of mini-batch size SGD with a batch size of 1.
8
2.1. Machine Learning
data loader
thread(s):
prepare
batch 1
prepare
batch 2
prepare
batch 3
prepare
batch 4
prepare
batch 5 ...
worker
thread:
train
batch 1
train
batch 2
train
batch 3
train
batch 4
train
batch 5 ...
time
Figure 2.1: Pipeline parallel preparation and training of batches.
iterates through the batches (line 6). It computes an update from each batch,
using the current model parameters (line 7) and applies this update to the
model parameters (line 8).
We depict batch preparation (lines 3–5) as a separate loop because it is
commonly run pipeline-parallel with the actual training, in one or multiple
separate data loader threads (Paszke et al. 2019; Abadi et al. 2016; T. Chen et al.
2015). The data loader threads prepare each batch shortly before it is trained
on by the worker thread. Figure 2.1 illustrates this pipeline parallelism.
2.1.3 Sparse Parameter Access
This thesis focuses on ML tasks in which parameter access is sparse, i.e.,
each update step reads and writes only a (small, and potentially tiny) subset
of all model parameters. For example, such sparse access is common in
natural language processing tasks (Mikolov et al. 2013; Pennington et al. 2014;
Peters et al. 2018; Devlin et al. 2019; Howard and Ruder 2018), knowledge
graph embeddings (Trouillon et al. 2016; Nickel et al. 2011; Balazevic et
al. 2019; Bordes et al. 2013; Kazemi and Poole 2018), some graph neural
networks (Schlichtkrull et al. 2018; Shang et al. 2019; Vashishth et al. 2020),
click-through prediction (H.-T. Cheng et al. 2016; Guo et al. 2017; R. Wang
et al. 2017; Zhou et al. 2018), and recommender systems (Koren et al. 2009;
P.
-
L. Chen et al. 2012; Hu et al. 2008). Parameter access is sparse in these
tasks because the models’ predictions for one training example depend only
on a subset of the model parameters. For example, consider an ML model
that associates one parameter with each word of the English language. A
training example is one sentence. In each sentence, only a fraction of all
English words occur. Thus, only a fraction of parameters is accessed for one
training example. Figure 2.2 illustrates this example.
In more detail, the existence and extent of sparsity depends on the ML
model, the training dataset, and the cost function. Parameter access is sparse
if (i) the model associates specific model parameters with specific properties
of the training examples, (ii) the training data are sparse, i.e., each training
example includes only a subset of all properties, (iii) and there is no update for
9
Chapter 2. Background
...
Model parameters:
planet
disaster
melting
level
model
the
climate
associate
glacier
degree
is
crisis
Training example:
«The glacier is melting. »
Parameter access
Parameter access: parameter not accessed parameter accessed
Figure 2.2: An example for sparse parameter access. Each square represents
one model parameter. The example ML model associates a few parameters with
each word of the English language. One training example, i.e., one sentence of
a text corpus, accesses only the parameters that are associated with the words
that occur in the sentence.
parameters that are associated with non-present properties (typically because
the gradient for these parameters is zero).
The opposite of sparse access is dense parameter access, i.e., each update
step reads and writes all model parameters. For example, dense access is
common in computer vision (LeCun et al. 1989; Ramesh et al. 2021). For
instance, in convolutional neural networks (LeCun et al. 1989), all parameters
are required to classify a training example (i.e., one image), so that each update
step accesses all parameters, see Figure 2.3.
In some models, parameter access is partially dense and partially sparse
(Peters et al. 2018; Devlin et al. 2019; Howard and Ruder 2018). Typically,
access to the first (embedding) layer and sometimes the last (classification)
layer is sparse, and access to other layers is dense. The share of parameters
that are accessed sparsely depends on the model architecture, but can be high,
e.g., around 90% in ELMo (Peters et al. 2018).
2.1.4 Parallel Stochastic Gradient Descent
SGD is a sequential algorithm: an update step depends on the updates done by
all previous update steps. In practice, two approaches are used to parallelize
SGD methods. The first one, synchronous (parallel) SGD (Zinkevich et al.
2010; J. Chen et al. 2016) parallelizes the computation of one mini-batch.
10
2.1. Machine Learning
...
Model parameters:
Training example:
Parameter access
Parameter access: parameter not accessed parameter accessed
Figure 2.3: An example for dense parameter access. All model parameters are
required to classify one training example (e.g., one image).
I.e., the gradients for the examples of a mini-batch are computed by different
workers in parallel. The gradients are then aggregated among the workers
and the update is applied to the parameters. Synchronous SGD is equivalent
to sequential execution. However, it requires a global barrier at the end of
each update step to wait until all participating workers have finished gradient
computation. This can be inefficient for ML tasks in which gradient compu-
tation is relatively cheap (Niu et al. 2011). The overhead of the barrier can be
reduced to some extent by launching additional backup workers to compute
gradients for more training examples than needed and then using only the
gradients of the fastest workers (J. Chen et al. 2016). This method decreases
overall efficiency, as additional gradients are computed, but ignored.
In the second approach, asynchronous SGD (Bengio et al. 2000; Niu et al.
2011; Dean et al. 2012), workers carry out update steps independent of each
other, potentially reading stale parameter values and potentially overwriting
updates by other workers. Asynchronous SGD is not equivalent to sequential
execution. Nevertheless, asynchronous SGD is popular—in particular for
tasks with relatively cheap update steps—because it does not require a barrier
aftereachupdate step. Itis especiallysuitable forML taskswithsparse updates
because updates are overwritten less frequently (Niu et al. 2011).
Algorithm 2.2 depicts an example shared memory implementation of
asynchronous parallel SGD. Multiple worker threads execute the depicted
program in parallel. The worker threads access the model parameters
W
via shared memory. In contrast to the sequential implementation (see Al-
11
Chapter 2. Background
gorithm 2.1), the set of all batches is divided up among all worker threads,
such that each batch is processed only once. The batches are assigned to
workers based on an identifier for the worker thread (
t
) and the total number
of worker threads T(lines 2 and 5).
Algorithm 2.2:
Asynchronous parallel mini-batch SGD. The pro-
gram is run by multiple worker threads in parallel. Differences to a
sequential implementation (Algorithm 2.1) are highlighted in blue.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
W: model parameters (shared memory),
t: the ID of this worker thread,
T: the number of total worker threads
1for epoch 1to num_epochs do
2b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
3B= []
4for i1to bdo
5Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
// training
6for i1to bdo
7W=compute_update ( Bi,W)
8WW+W
The worker threads write updates to the model parameters
W
con-
currently. There are implementations of asynchronous SGD that use no
concurrency control for these updates (Mikolov et al. 2013), such that up-
dates can be overwritten. Other implementations use locking mechanisms or
atomics to ensure that there are no lost updates (Niu et al. 2011). For simpler
notation, we depict that the update step reads (line 7) and writes (line 8)
all model parameters
W
. In practice, an update step for an ML task with
sparse parameter access reads and writes only a subset of all model parameters.
I.e., the parameter update
W
consists mostly of zeros. The frequency of
concurrent updates decreases with increasing sparsity (Niu et al. 2011).
The program in Algorithm 2.2 is an implementation of asynchronous
parallel SGD. Analogously, synchronous SGD could be implemented. The
main difference is that in synchronous SGD workers wait after each update
step for all other workers to complete the update step. I.e., there would be a
barrier after each iteration of the training loop.
12
2.2. Distributed Parameter Management
2.2 Distributed Parameter Management
To keep up with increasing training data size and model complexity, model
training is often not only parallelized among multiple compute units (e.g.,
CPU cores, or hardware accelerators such as GPUs or TPUs) of one node,
but distributed to the compute units of multiple nodes of a cluster. The
main advantages of distributed training are that (i) one can potentially train
models that exceed the memory capacity of a single node and (ii) training
can potentially be accelerated, as the compute units of several computers can
work in parallel.
A common approach to distributed training is the data parallel one:
the training dataset is partitioned to the nodes of the cluster, i.e., each node
holds one partition of the training dataset. Each node accesses only its local
partition of the training data, but potentially reads and updates all model
parameters. Thus, distributed parameter management, i.e., providing global
read and write access to parameters and handling synchronization among the
nodes, is a key concern.
The term parameter server is used inconsistently in current literature
to refer to systems that provide such distributed parameter management.
Authors use it to refer to at least two different concepts. First, the term
is used to refer to architectures in which gradients and/or parameters are
communicated exclusively via physically separate, dedicated server nodes,
rather than by direct communication among the worker nodes. Second, the
term is used in a broader sense to describe the class of all systems that provide
distributed parameter management, i.e., that provide global parameter access
across a cluster. This broader definition includes systems that achieve this via
separate server nodes and systems that achieve this via direct communication
among the nodes. Throughout this thesis, we will use the second meaning.
I.e., a PS is any system that provides global parameter access across a
cluster. To this end, it provides primitives to read and write parameters
(commonly referred to as
pull
and
push
, respectively). The read and write
operations can be performed synchronously or asynchronously. Behind the
scenes, the PS transparently handles parameter management: for example,
some PSs partition parameters to nodes and, if necessary, send appropriate
messages to process read and write operations. To coordinate parameter
accesses across nodes, the PS assigns unique keys to parameters. The push
operation is usually cumulative, i.e., the client sends an update term to the
PS, which then adds this term to the parameter value. Many ML stacks use
PSs as a component, e.g., TensorFlow (Abadi et al. 2016), MXNet (T. Chen
et al. 2015), PyTorch BigGraph (Lerer et al. 2019), STRADS (J. K. Kim et al.
13
Chapter 2. Background
node 1
worker 1
worker 2
...
worker n
helper(s)
node 2
...
node 3
...
Allocation
allocated at this node
not locally available
Figure 2.4: The distributed architecture that we assume in this thesis: multi-
ple network-connected nodes, each holding a partition of the training dataset
(), several worker threads, potentially helper threads, and some model pa-
rameters ( ). The parameter allocation depends on the chosen PS; the figure
exemplarily depicts the one of a classic PS.
2016), STRADS-AP (J. K. Kim et al. 2019), or Project Adam (Chilimbi et al.
2014), and there exist multiple standalone PSs, e.g., Petuum (Ho et al. 2013),
PS-Lite (Mu Li et al. 2014a), Angel (J. Jiang et al. 2017), FlexPS (Yuzhen
Huang et al. 2018), Glint (Jagerman et al. 2017), PS2 (Z. Zhang et al. 2019),
and BytePS (Y. Jiang et al. 2020).
Throughout this thesis, we assume the following distributed architecture
(see Figure 2.4). There are multiple cluster nodes, connected by network links.
Each node holds one partition of the training dataset. On each node, there is
one or multiple worker threads. Additionally, there can be one or multiple
helper threads at each node, e.g., for synchronizing parameters among nodes
in the background or for processing parameter access operations of other
nodes. We further assume that each of these nodes holds some of the model
parameters, either through (exclusive) allocation or through replication. Such
co-location of the model parameters on the same physical nodes as the worker
threads is commonly done in recent PSs for efficiency (Jagerman et al. 2017;
J. Jiang et al. 2017; Ho et al. 2013; Yuzhen Huang et al. 2018).
2
Which node
holds which parameters depends on the approach for distributed parameter
management. Wewill discussdifferentapproachesin thefollowing. Figure2.4
depicts the approach of a Classic PS: parameters are partitioned to the nodes
(see Section 2.2.1 for details).
Access to local parameters is cheaper than access to remote parameters.
For example, local parameters could be accessed via shared memory (see
Section 3.2.3). In contrast, remote access is more expensive because the
2
In contrast, early PSs stored the parameters on physically separate server nodes (Smola
and Narayanamurthy 2010; Ahmed et al. 2012).
14
2.2. Distributed Parameter Management
Allocation: allocated at node replicated at node not available locally
initial early late
node 1
node 2
node 3
Figure 2.5: An example for parameter allocation in a classic PS. The parameters
are partitioned to the nodes. Parameter allocation is static, i.e., it does not
change throughout training.
parameter value (or the parameter update) needs to be transferred over the
network between the node that accesses the parameter and a node that holds
the parameter (or a replica). The cost of remote parameter access is visible
throughout the experiments in this thesis, for example in Section 3.3.2.
In the following, wediscuss fundamentalapproachesforproviding global
parameter access in such a distributed architecture.
2.2.1 Classic Parameter Server
Aclassic PS (Smola and Narayanamurthy 2010; Ahmed et al. 2012; Mu Li et al.
2014a) such as PS-Lite partitions the model parameters to the nodes of the
cluster. Thus, precisely one node holds the current value of a parameter, and
no replicas are created. To read a parameter, the classic PSs sends messages
over the network to retrieve the parameter value from the node that holds
this parameter. Analogously, it sends the update to the corresponding node
for a write operation. Figure 2.5 depicts an example for parameter allocation
in a classic PS. The example depicts parameter allocation for a three-node
cluster at three different points in time: (i) initially, before training starts,
(ii) at some early time of training, and (iii) at some late time of training. For
a classic PS, parameter allocation is the same during these three points in
time. In other words, parameter allocation in a classic PS is static. We will
see non-static approaches below and throughout this thesis.
15
Chapter 2. Background
One advantage of Classic PSs is that they can provide sequential consis-
tency (Lamport 1979) (see Section 3.2.4 for our analysis). That is, (1) each
worker’s operations are executed in the order specified by the worker, and
(2) the result of any execution is equivalent to an execution of the operations
of all workers in some sequential order. This consistency guarantee ensures
that classic PSs have no or only small negative effect on the convergence of
distributed training compared to sequential implementations. In contrast,
more relaxed consistency can slow down convergence (Ho et al. 2013; Dai
et al. 2015). A second advantage of classic PSs is that there is no need to
determine when and how often replicas should be synchronized (because
there are no parameter replicas). This makes classic PSs relatively easy to use.
A third advantage is that the maximum possible model size scales linearly
with the number of nodes.
The main disadvantage of a classic PS is that its performance can be very
limitedbecause most parameteraccesses induce accesslatencyforone network
round trip. For example, to read a parameter, the accessing node contacts
the node that holds the parameter, which then transfers the parameter value
to the accessing node. This overhead can slow down distributed training
dramatically. For example, in our experiments in Section 3.3.2, distributed
implementations with a classic PS on 8 nodes were up to 22x slower than an
efficient single node implementation.
Let us now consider how an application (i.e., the component that in-
teracts with the PS) can leverage a classic PS to implement distributed train-
ing. Algorithm 2.3 depicts an example implementation of distributed asyn-
chronous mini-batch SGD in a classic PS. As in parallel SGD (see Algo-
rithm 2.2), multiple workers run the depicted program in parallel. In the
distributed setup, these workers are spread across physically separate cluster
nodes. The workers access the parameters via the pull and push primitives
of the PS. For each batch, the worker reads the subset of parameters that is
required to process this batch (line 7), computes an update (line 8), and writes
the computed update back into the PS (line 9). We write
w
for the subset
of model parameters that is required to process a batch (i.e.,
wW
) and
keys(Bi)
for an application-specific function that returns the keys of the
parameters that are required to process a given batch Bi.
2.2.2 Static Full Replication
Static full replication parameter management statically replicates all parame-
ters to all cluster nodes throughout training. The replicas are synchronized
periodically, either synchronously (triggered by the application) or asyn-
16
2.2. Distributed Parameter Management
Algorithm 2.3:
Distributed asynchronous SGD with a classic PS.
The program is run by many distributed worker threads in par-
allel. Differences to the shared-memory parallel implementation
(Algorithm 2.2) are highlighted in orange.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
t: the ID of this worker thread,
T: the number of total worker threads
1for epoch 1to num_epochs do
2b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
3B= []
4for i1to bdo
5Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
// training
6for i1to bdo
7w=pull ( keys(Bi))
8w=compute_update ( Bi,w)
9push ( keys(Bi), w)
chronously in the background. Figure 2.6 illustrates parameter allocation
in static full replication. Again, parameter allocation is static: it does not
change throughout training.
The advantage of static full replication is that all parameters can be
accessed without synchronous network communication because all nodes
hold a local copy of all model parameters. This is one reason why static full
replication is popular for distributed training of models with dense parameter
access (LeCun et al. 1989; Sergeev and Balso 2018; S. Li et al. 2020). As every
update step accesses all parameters, it is desirable to provide fast access to all
parameters at all times. The static full replication approach is currently often
paired with synchronous parallel SGD. There are PSs that are specialized for
this setting. For example, BytePS (Y. Jiang et al. 2020) assumes a full model
replica on all nodes, and focuses on efficiently synchronizing these replicas
after each update step. The approach of BytePS is to synchronize the replicas
via physically separate, cheaper (“server”) nodes.
For ML tasks with sparse parameter access, however, static full replica-
tion is communication-inefficient. The problem is that it maintains replicas
for all parameters on all nodes throughout training, even though each node
accesses only a small subset of these replicas at each point in time. Thus,
static full replication sends replica updates that are never read. This over-
17
Chapter 2. Background
Allocation: allocated at node replicated at node not available locally
initial early late
node 1
node 2
node 3
Figure 2.6: An example for parameter allocation with full static replication.
Each node holds a local copy of each model parameter (either the main copy of
the parameter or a replica).
communication can drastically slow down distributed training. For example,
in our experiments in Section 5.4.4, full static replication on 8 nodes was
significantly slower than efficient single node implementations.
A further disadvantage is that, with static full replication, the maximum
model size does not increase with the number of nodes. On the contrary,
the size of the ML model is limited by the memory capacity of a single
node, as each node holds a replica of the entire model. Thus, static full
replication is infeasible for very large models. Another disadvantage is that
there potentially is staleness among replicas. Common ML systems (S. Li
et al. 2020; Abadi et al. 2016; T. Chen et al. 2015) prevent such staleness by
explicitly synchronizing after every step of synchronous parallel mini-batch
SGD, which can create significant network overhead.
2.2.3 Selective Replication
Areplication PS sets up and tears down replicas selectively (Ho et al. 2013;
Yuzhen Huang et al. 2018; J. Jiang et al. 2017; Dai et al. 2015; Cui et al.
2014). I.e., parameters are still allocated statically, as in a classic PS, but the
PS may dynamically replicate subsets of the parameters to additional nodes
to reduce communication overhead (Ho et al. 2013) (see Figure 2.7 for an
example). These subsets differ from node to node and from time to time.
There are different protocols for deciding which subset of parameters should
be replicated at which node at which point in time (see below). Replication
18
2.2. Distributed Parameter Management
PSs provide weaker consistency guarantees than Classic PSs, such as bounded
staleness. For example, Petuum (Ho et al. 2013), a popular replication PS, lets
the application specify a logical clock (one per worker) and guarantees that
local parameter values are not older than an application-specified staleness
bound
B
with respect to this logical clock. Workers maintain their logical
clock via an
advanceClock()
operation. Different staleness bounds are
suitable for different ML tasks (Ho et al. 2013). It is the responsibility of the
application to find a suitable staleness bound for the desired ML task.
Algorithm 2.4 depicts an example for how asynchronous mini-batch
SGD can be implemented in Petuum. The program is similar to the one for
a Classic PS, with two additions: (i) the application configures the staleness
bound (line 1) and (ii) each worker advances its clock after each batch (line 11).
There are two main protocols for deciding which subset of the model
parameters should be replicated at which node at which point in time: stale
synchronous parallel (SSP) (Ho et al. 2013) and eager stale synchronous parallel
(ESSP) (Dai et al. 2015).
Algorithm 2.4:
Distributed asynchronous SGD with Petuum. Dif-
ferences to a Classic PS implementation (Algorithm 2.3) are high-
lighted in green.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
t: the ID of this worker thread,
T: the number of total worker threads ,
staleness_bound: staleness bound
1set_staleness_bound ( staleness_bound )
2for epoch 1to num_epochs do
3b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
4B= []
5for i1to bdo
6Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
// training
7for i1to bdo
8w=pull ( keys(Bi))
9w=compute_update ( Bi,w)
10 push ( keys(Bi), w)
11 advanceClock ()
19
Chapter 2. Background
Allocation: allocated at node replicated at node not available locally
initial early late
node 1
node 2
node 3
Figure 2.7: An example for parameter allocation in an SSP replication PS. The
parameters are partitioned to the nodes as in a classic PS, but the PS selectively
replicates subsets of parameters to the different nodes. An SSP replication PS
terminates a replica when an application-specified staleness bound is reached.
Stale synchronous parallel (SSP)
SSP sets up a replica for a parameter
p
at a node
n
when node
n
accesses
parameter
p
. It continues to use this replica for parameter reads until the
local worker has advanced its clock
B
(the staleness bound) times since replica
setup. At this point, the PS terminates the replica. It sets up a replica again
when node
n
accesses parameter
p
again. Figure 2.7 illustrates an example
for parameter allocation in an SSP replication PS. Replica updates are accu-
mulated locally and propagated to the node that holds the main copy of the
parameter at the subsequent advanceClock invocation.
ThemainadvantageofSSPisthatonlyasmallnumberofreplicasexistsat
each point in time.
3
Only this small number of replicas has to be maintained.
The main disadvantage of SSP is that replicas are set up reactively: the PS
sets up the replica when the worker reads the parameter. I.e., the worker has
to wait synchronously for the replica to set up (Dai et al. 2015). This can
severely limit the performance of SSP. For example, in our experiments in
Section 4.4.2, SSP on 8 nodes was more than 5x slower than an efficient single
node implementation.
3
Assuming a realistic setting for the staleness bound. With excessively large staleness
bounds, an excessively large number of replicas will be maintained at each point in time.
20
2.2. Distributed Parameter Management
Allocation: allocated at node replicated at node not available locally
initial early late
node 1
node 2
node 3
Figure 2.8: An example for parameter allocation in an ESSP replication PS. In
contrast to SSP (Figure 2.7), ESSP does not terminate replicas. Once a replica
is set up, ESSP maintains the replica throughout training (by continuously
propagating updates). Thus, the number of replicas typically increases until
each node holds replicas of (almost) all parameters.
Eager stale synchronous parallel (ESSP)
ESSP also creates a replica when a parameter is (first) accessed, but then
maintains this replica throughout the entire training task. I.e., after a node
n
accesses a parameter
p
once, the PS will continuously propagate all updates
for parameter
p
to node
n
. Figure 2.8 illustrates an example for parameter
allocation in an ESSP replication PS. Petuum ESSP guarantees that a read
parameter value is not older than the configured staleness threshold. Behind
the scenes, however, ESSP aims to send updates as soon as possible, so that
the typical actual staleness (ideally) is lower than the configured staleness
bound (Dai et al. 2015). As in SSP, replica updates are accumulated locally
and propagated to the node that holds the main copy of the parameter at the
subsequent advanceClock invocation.
The main advantage of ESSP over SSP is that—after the initial creation
of a replica for parameter
p
upon first access of node
n
—no further waiting
for synchronous replica setup is necessary for parameter
p
at node
n
, as the
replica is maintained throughout training. However, this eager replica setup
comes at the cost of over-communication: for common ML tasks, the ESSP
protocol is similar to full static replication after a short period of time: it
maintains replicas for (almost) all parameters on all nodes. The reason for this
is that, in common ML tasks, each node accesses almost all parameters over
time. As for static full replication, this can make distributed implementations
21
Chapter 2. Background
inefficient. For example, in our experiments in Section 4.4.2, ESSP on 8
nodes was more than 4x slower than an efficient single node implementation.
2.3 Related Work
We discuss the most closely related work (on PSs) in the previous section. In
the following, we discuss other related work on efficient distributed ML.
2.3.1 Update Compression
Several forms of (lossy) compression have been proposed to reduce the size
of the updates that need to be exchanged over the network. One direction is
lower-precision updates, i.e., to reduce the number of bits for each element of
the update (Seide et al. 2014; Wen et al. 2017; Alistarh et al. 2017; H. Zhang
et al. 2017; Hubara et al. 2017; Bernstein et al. 2018). Another direction is
sparsification, i.e., to selectively communicate only some elements of the
update (Mu Li et al. 2014b; Aji and Heafield 2017; Lin et al. 2018; H. Wang
et al. 2018). Some of these techniques accumulate un-synchronized updates
locally and transfer them at a later point in time, such that most (or all)
updates are eventually synchronized. Recent research has drawn into question
how beneficial compression techniques are in recent distributed ML settings,
as encoding and decoding take time and especially sparsification prevents
optimizations for dense communication (Agarwal et al. 2022). Nevertheless,
in general, work on update compression is orthogonal to the work in this
thesis, and integrating these directions into PSs might make for interesting
directions for future work.
2.3.2 Higher Compute Intensity
One way to reduce the impact of communication in distributed ML is to
increase the computation-to-communication ratio of the update steps. This
is commonly done by increasing the batch size of mini-batch SGD (Goyal
et al. 2017; You et al. 2017b). With a larger batch size, more computation
is necessary in one update step as the model is evaluated for each training
example. At the same time, for dense ML tasks, the amount of communi-
cation remains constant as the gradients for the different training examples
are summed up before communication. However, depending on model and
data complexity, increasing the batch size potentially impairs model quality
(specifically, generalization error) (Keskar et al. 2017; Masters and Luschi
2018; Shallue et al. 2019), giving rise to a field of research on maintaining
high model quality with large batch sizes (You et al. 2017a; You et al. 2020;
22
2.3. Related Work
Huo et al. 2021). Unfortunately, increasing batch size has only limited ef-
fect on compute intensity for ML tasks with sparse access—the focus of this
thesis—because the amount of communication for an update step increases
with increasing batch size for sparse ML tasks. The reason for this is that an
update step in a sparse ML tasks communicates gradients only for the affected
parameters (typically a tiny subset of all parameters). Each training example
typically affects different parameters, such that an update step with more
training examples affects more parameters. Thus, the larger the batch size,
the more communication is required for this batch.
2.3.3 Deep Learning
Deep learning has become a highly popular subfield of ML, with a large
body of research focusing on efficiently training multi-layer networks with
synchronous SGD, with mostly dense access into large parameter tensors.
The key challenge in this field is to efficiently exchange (entire) large tensors
among nodes. One direction in this field is the development of efficient
all-reduce methods (Sergeev and Balso 2018; Awan et al. 2017; S. Wang et al.
2019), e.g., by exchanging tensors in ring topologies, and alternative dense
communication methods (Y. Jiang et al. 2020). Another direction is to overlap
communication with computation of different layers of the model (Yanping
Huang et al. 2019; Narayanan et al. 2019; Fan et al. 2021; Bowen Yang et al.
2021), e.g., to start exchanging the gradients for higher layers already while
the backward pass of lower layers is still running (Yanping Huang et al. 2019).
These research directions are not directly applicable to sparse ML tasks be-
cause all-reduce is not efficient for synchronizing sparsely updated tensors and
there typically is no or only little potential for overlapping communication
and computation synchronously within one batch in sparse ML models.
2.3.4 Other Task-Specific Approaches
There is a range of approaches to increase efficiency of distributed parameter
management for specific classes of ML tasks other than deep learning. These
approaches, too, aretailoredtothe respectiveclass oftasks, and arenot directly
applicable to other tasks. For example, there are many approaches specifically
for distributed training of large deep learning recommender systems (W.
Zhao et al. 2020; Adnan et al. 2021; Yuzhen Huang et al. 2021; Miao et al.
2022). And there are many approaches specifically for distributed training of
graph neural networks (R. Zhu et al. 2019; D. Zhang et al. 2020; Min et al.
2021; D. Zheng et al. 2020a; J. Peng et al. 2022). In contrast, we focus on
general-purpose distributed parameter management in this thesis.
23
Chapter 2. Background
2.3.5 Faster Interconnect Hardware
Another direction to reduce the impact of communication is to develop and
use faster interconnect hardware. Fast network interconnects, such as In-
finiBand, enable data exchange in the order of 10s of gigabytes per second
between nodes (e.g.,
10.2GB
per second in a InfiniBand EDR 4x cluster
(Ziegler et al. 2022)). Faster interconnects help to reduce the impact of com-
munication overhead in distributed ML. However, existing interconnects do
not eliminate the communication overhead bottleneck. For example, we run
our experiments in Sections 4.4 and 5.4 on a cluster with a modern
100GBit
InfiniBand network. Nevertheless, existing approaches to parameter man-
agement (e.g., full replication, a classic PS, and existing replication PSs) are
inefficient. In particular, modern interconnects do not eliminate the latency
hierarchy, i.e., that, for example, access to local DRAM is faster (in the or-
der of 100 nanoseconds) than access to remote memory (even with RDMA
over InfiniBand, at least a few microseconds (Ziegler et al. 2020; Kalia et al.
2016)). This hierarchy is unlikely to disappear any time soon: latency tends
to improve slower than bandwidth with new interconnects and is inherently
limited by the speed of light (Patterson 2004).
Communication links that connect the GPUs of one node directly, such
as NVLink, allow GPU-to-GPU transfers (within one node) in the order of
100s of gigabytes per second (A. Li et al. 2020). An interesting line of research
aims to build PSs for the resulting heterogeneous communication topology
(Y. Jiang et al. 2020; Miao et al. 2021; G. Wang et al. 2020; J. Jiang et al. 2017).
It is a particularly interesting area for future work to apply the ideas of this
thesis to such heterogeneous topologies (see also our discussion of future
work in Chapter 6).
2.3.6 Decentralized Training
A further direction for reducing communication overhead is decentralized
training approaches (Lian et al. 2017; Tang et al. 2018; Assran et al. 2019). The
key idea is that the individual nodes of the cluster learn independently, and
exchange (some of) their learning progress (i.e., the model parameters) only
from time to time, and potentially only to some (and not all) other nodes.
The main difference of decentralized approaches to the distributed parameter
management approaches discussed above is that there is no (“central”) point
of truth for a parameter that is guaranteed to eventually receive all updates
for this parameter. Often, these approaches draw ideas from gossip com-
munication algorithms (Demers et al. 1987). They introduce novel training
algorithms that differ from the popular and (comparably) well-understood
24
2.3. Related Work
SGD training. In particular, these training algorithms have different conver-
gence properties and guarantees (Lian et al. 2017). In contrast, we focus on
SGD training in this thesis.
2.3.7 Programming Abstractions and Scheduling
There are many systems that introduce programming abstractions and/or
ways to schedule computation for ML (often as parts of larger system stacks).
Some of these are purpose-built for ML, such as PyTorch (Paszke et al. 2019;
S. Li et al. 2020), TensorFlow (Abadi et al. 2016), MXNet (T. Chen et al. 2015),
and SystemDS (Ghoting et al. 2011; Boehm et al. 2016; Boehm et al. 2020).
Others are more general, but also applicable to ML tasks, such as data flow
systems (Zaharia et al. 2010; Carbone et al. 2015) and graph-based systems
(Low et al. 2012; Malewicz et al. 2010). Distributed parameter management
is one component in the stack of these systems. In contrast to these broader
systems, we specifically focus on this single component in this thesis. I.e., we
treat the sequence of parameter access operations as set by the application and
try to make PSs more efficient for common sequences. Such improvements to
the PS component can potentially be leveraged by programming abstractions
and scheduling approaches.
2.3.8 Key–Value Stores
PSs are key–value stores that are specialized for storing parameters of ML
models. In fact, early work on PSs (Smola and Narayanamurthy 2010) used
memcached,
4
an off-the-shelf general-purpose key–value store, to manage
parameters. Since then, several different types of PSs, i.e., systems that are
built for the specific purpose of storing ML parameters, have been devel-
oped (Mu Li et al. 2014a; Ho et al. 2013; J. Jiang et al. 2017; Yuzhen Huang
et al. 2018; Jagerman et al. 2017; Z. Zhang et al. 2019; Y. Jiang et al. 2020). In
contrast to general-purpose key–value stores, PSs (1) are designed for storing
multi-dimensional arrays, such as vectors and tensors (Mu Li et al. 2014b),
(2) are often co-located on the same nodes as the application processes to
enable low-latency access (because ML applications access the PS frequently)
(Jagerman et al. 2017; J. Jiang et al. 2017; Ho et al. 2013; Yuzhen Huang et al.
2018), (3) often work with relaxed consistency schemes, such as bounded
staleness (Ho et al. 2013; Dai et al. 2015) or fully asynchronous updates (Dean
et al. 2012), and (4) integrate ML-specific functionality directly into the PS,
such as update filtering or compression (Mu Li et al. 2014b).
4https://memcached.org/
25
Chapter 2. Background
26
Chapter 3
Exploiting Locality:
Dynamic Parameter Allocation
A set of implementations of distributed ML training reduce communication
overhead by employing specific techniques to create and/or exploit parameter
access locality (PAL). PAL refers to the tendency that different nodes access
different subsets of parameters at specific points in time. There exist several
different such PAL techniques (Gemulla et al. 2011; Yun et al. 2014; Teflioudi
et al. 2012; Beutel et al. 2014; Lerer et al. 2019; Raman et al. 2019; Yu et al.
2015; B. Peng et al. 2017; Low et al. 2012; Gonzalez et al. 2012; Nakandala
et al. 2019). Some techniques explicitly create locality, e.g., by clustering
training data according to the parameters that training examples access, or
by restricting nodes to specific parameter subsets at specific points in time.
Another approach is to exploit locality that is inherent in common ML tasks.
Existing PSs provide only limited support for PAL techniques: some
techniques are tedious to implement in PSs (because they require knowl-
edge of PS internals and careful key design), others are entirely impossible
to implement in existing PSs (because existing PSs allocate parameters stati-
cally). Due to this limited support, PAL techniques have to be implemented
manually—outside PSs—, using low-level distributed programming primi-
tives (Gemulla et al. 2011; Yun et al. 2014; Teflioudi et al. 2012; Lerer et
al. 2019). This low-level programming requirement makes PAL techniques
complex to implement and, consequently, hinders their adoption.
In this chapter, we explore whether and to what extent PSs can exploit
locality, and whether that is beneficial. First, we discuss common PAL tech-
niques and analyze what is necessary for PSs to support them (Section 3.1).
Based on this analysis, we propose dynamic parameter allocation (DPA), i.e.,
that the PS adapts its parameter allocation to the ML task. DPA allows the
PS to dynamically relocate parameters to where they are currently accessed,
27
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
while providing location transparency and PS consistency guarantees, i.e.,
sequential consistency. We discuss design options for PSs with DPA and
describe a prototype implementation of such a PS called Lapse (Section 3.2).
Our experimental evaluation shows (i) that existing PSs—without DPA—
barely outperform efficient single node baselines and (ii) that DPA drastically
improves efficiency for some ML tasks (Section 3.3).
3.1 The Case for Dynamic Parameter Allocation
We outline common PAL techniques used in distributed ML (Section 3.1.1).
For each technique, we discuss to what extent it is supported in existing PSs
and identify what features are required to enable or improve support. Finally,
we propose DPA, which enables PSs to exploit PAL techniques directly
(Section 3.1.2).
3.1.1 PAL Techniques
We consider three common PAL techniques, i.e., techniques that applications
employ to create and/or exploit access locality: data clustering, parameter
blocking, and latency hiding.
Data Clustering
One method to reduce communication cost is to exploit structure in training
data (Low et al. 2012; Gonzalez et al. 2012; R. Chen et al. 2015; Smola and
Narayanamurthy 2010; Ahmed et al. 2012). For example, consider a training
dataset that consists of documents written in two different languages and an
ML model that associates a parameter with each word (e.g., a bag-of-words
classifier or word embeddings). When processing a document during training,
only the parameters for the words contained in the document are relevant.
This property can be exploited using data clustering. For example, if a separate
worker is used for the documents of each language, different workers access
mostly separate parameters. This is an example of PAL: different workers
access different subsets of the parameters at a given time. This locality can be
exploited by allocating parameters to the worker machines that access them.
Figure 3.1 depicts an example; here, rows correspond to documents, dots to
words, and each parameter is allocated to the node that accesses the parameter
most frequently.
Data clustering can be exploited in existing PSs in principle, although
it is often painful to do so because PSs provide no direct control over the
allocation of the parameters. Instead, parameters are typically partitioned
28
3.1. The Case for Dynamic Parameter Allocation
Parameter allocation: allocated at worker 1 allocated at worker 2
Data focus: worker 1 worker 2
DATA
@worker 1@worker 2
PARAMETERS
fixed allocation
Figure 3.1: The data clustering PAL technique: the training data are clustered
such that each worker accesses mostly a separate subset of parameters. Rows
correspond to data points, columns to parameters, and black dots to parame-
ter access.
using either hash or range partitioning. To exploit data clustering, appli-
cations may manually enforce the desired allocation by key design, i.e., by
explicitly assigning keys to parameters such that the parameters are allocated
to the desired node. Such an approach requires knowledge of PS internals,
preprocessing of the training data, and a custom implementation for each
task. To improve support for data clustering, PSs should provide support for
explicit parameter location control.
To exploit data clustering, it is essential that the PS provides
fast access
to local parameters
; e.g., by using shared memory as in manual implemen-
tations (Gemulla et al. 2011; Yun et al. 2014). However, to the best of our
knowledge, all existing PSs accessparameters eitherthroughinter-process(Mu
Li et al. 2014a; Jagerman et al. 2017; J. Jiang et al. 2017) or inter-thread com-
munication (Xing et al. 2015; Yuzhen Huang et al. 2018), leading to overly
high access latency. We will see in the experiments in Section 3.3.2 that such
inter-process and inter-thread communication are not sufficient to provide
fast parameter access.
29
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Parameter allocation: allocated at worker 1 allocated at worker 2
Data focus: worker 1 worker 2
DATA
@worker 1@worker 2
PARAMETERS
allocation in subepoch 1
PARAMETERS
allocation in subepoch 2
text
Figure 3.2: The parameter blocking PAL technique: within each subepoch,
each worker is restricted to one block of parameters; which worker has access
to which block changes from subepoch to subepoch. Rows correspond to data
points, columns to parameters, black dots to parameter access in the current
subepoch, and gray dots to parameter access in another subepoch.
Parameter Blocking
An alternative approach to provide PAL is to divide the model parameters
into blocks. Training is split into subepochs such that each worker is restricted
to one block of parameters within each subepoch. Which worker has access to
which block changes from subepoch to subepoch, however. Such parameter
blocking approaches have been developed for many ML algorithms, including
matrix factorization (Gemulla et al. 2011; Yun et al. 2014; Teflioudi et al.
2012), tensor factorization (Beutel et al. 2014), latent dirichlet allocation (Yu
et al. 2015; B. Peng et al. 2017), multinomial logistic regression (Raman et al.
2019), and knowledge graph embeddings (Lerer et al. 2019). They were also
proposed for efficient multi-model training (Nakandala et al. 2019).
Manual implementations exploit parameter blocking by allocating pa-
rameters to the node where they are currently accessed (Gemulla et al. 2011;
Teflioudi et al. 2012; Yun et al. 2014), thus eliminating network communi-
cation for individual parameter accesses. Communication is required only
between subepochs. Figure 3.2 depicts a simplified example for a matrix
factorization task (Gemulla et al. 2011). In the first subepoch, worker 1
(worker 2) focuses on the left (right) block of the model parameters. It does
30
3.1. The Case for Dynamic Parameter Allocation
so by only processing the corresponding part of its data and ignoring the
remainder. In the second subepoch, each worker processes the other block
and the other part of its data. This process is repeated multiple times.
Existing PSs offer limited support for parameter blocking because they
allocate parameters statically. This means that a parameter is assigned to one
server and stays there throughout training. It is therefore not possible to
dynamically allocate a parameter to the node where it is currently accessed.
Parameter blocking can be emulated to some extent in replication PS archi-
tectures, however. This requires the creation of replicas for each block and
forced refreshes of replicas between subepochs. Such an approach is limited
to synchronous parameter blocking approaches, requires significant changes
to the implementation, and induces unnecessary communication (because
parameters are transferred via their server instead of directly from worker
to worker). To exploit parameter blocking efficiently, PSs need to support
parameter relocation
, i.e., the ability to move model parameters among
nodes during run time.
Latency Hiding
Latency hiding techniques reduce communication overhead (but not commu-
nication itself). For example, prefetching is commonly used when there is
a distinction between local and remote data, such as in processor caches (A.
Smith 1982) or distributed systems (Steen and Tanenbaum 2017). In dis-
tributed ML, the latency of parameter access can be reduced by ensuring that
a parameter value is already present at a worker when it is accessed (Dai et al.
2015; Cui et al. 2014; Teflioudi et al. 2012). Such an approach is beneficial
when parameter access is sparse, i.e., each worker accesses few parameters
at a time. Note that latency hiding does not require the ML algorithm to
explicitly create locality (as data clustering and parameter blocking do). Thus,
latency hiding is both easier to apply and more widely applicable to ML tasks
than data clustering and parameter blocking.
Prefetching can be implemented by pulling a parameter asynchronously
before it is needed. The disadvantage of this approach is that an applica-
tion needs to manage prefetched parameters, and that updates that occur
between prefetching a parameter and using it are not visible. Therefore, such
an approach provides neither sequential nor causal consistency (we discuss
and analyze consistency guarantees of PSs in Section 3.2.4). Moreover, the
exchange of parameters between different workers always involves the server,
which may be inefficient. An alternative approach is the ESSP consistency
protocol of Petuum (Dai et al. 2015), which proactively replicates all previ-
31
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Parameter allocation: allocated at worker 1 allocated at worker 2
Data focus: worker 1 worker 2
current
prefetch
current
prefetch
DATA
@worker 1@worker 2
PARAMETERS
dynamic allocation
Figure 3.3: The latency hiding PAL technique: asynchronously prefetching (or
prelocalizing) of parameter values, such that they can be accessed locally, hides
access latency. Rows correspond to data points, columns to parameters, and
black dots to parameter access.
ously accessed parameters at a node. This approach avoids manual parameter
management, but does not provide sequential consistency. ESSP also causes
over-communication, as typically, after a short warm-up time, nodes will
hold more replicas than necessary (we discuss this effect of ESSP in more
detail in Section 4.2).
An alternative to prefetching is to prelocalize a parameter before access,
i.e., to reallocate the parameter from its current node to the node where it is
accessed and to keep it there afterward (until it is prelocalized by some other
worker). This approach is illustrated in Figure 3.3. Note that, in contrast to
prefetching, theparameteris notreplicated. Consequently, parameter updates
by other workers are immediately visible after prelocalization. Moreover,
there is no need to write local updates back to a remote location as the
parameter is now stored locally. To support prelocalization, PSs need to
support
parameter relocation with consistent access
before, during, and
after relocation. For example, no updates should be lost if a parameter is
accessed during relocation. Ideally, a PS with relocation would maintain the
strong consistency guarantees of classic PSs.
32
3.1. The Case for Dynamic Parameter Allocation
3.1.2 Dynamic Parameter Allocation
As discussed above, existing PSs offer limited support for the PAL techniques
of distributed ML. The main obstacles are that existing PSs provide limited
control over parameter allocation and perform allocation statically. In more
detail, weidentified the following requirements to enable or
improve support
:
Fast local access.
PSs should provide low-latency access to local parameters.
Parameter location control.
PSsshouldallowapplicationstocontrolwhere
a parameter is stored.
Parameter relocation.
PSs should support relocating parameters between
servers during run time.
Consistent access.
Parameter access should be consistent before, during,
and after a relocation.
To satisfy these requirements, the PS must support DPA, i.e., it must be
able to change the allocation of parameters during run time. While doing
so, the PS semantics must not change:
pull
and
push
operations need to
be oblivious of a parameter’s current location and provide correct results
whether or not the parameter is currently being relocated. This requires the
PS to manage parameter locations, to transparently route parameter accesses
to the parameter’s current location, to handle reads and writes correctly
during relocations, and to provide to applications new primitives to initiate
parameter relocations.
A DPA PS enables support for PAL techniques roughly as follows: each
worker instructs the PS to localize the parameters that it will access frequently
in the near future, but otherwise uses the PS as it would use any other PS, i.e.,
via the
pull
and
push
primitives. For data clustering, applications control
parameter locations once in the beginning: each node localizes the parameters
that it accesses more frequently than the other nodes. Subsequently, the
majority of parameter accesses (using
pull
and
push
) is local. For parameter
blocking, at the beginning of each subepoch, applications move parameters
of a block to the node that accesses them during the subepoch. Parameter
accesses (both reads and writes) within the subepoch then require no further
network communication. Finally, for latency hiding, workers prelocalize
parameters before accessing them. When the parameter is accessed, latency is
low because the parameter is already local (unless another worker localized
the parameter in the meantime). Concurrent updates by other workers are
seen locally, because the PS routes them to the parameter’s current location.
33
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
process at node 1
parameters
worker thread 3
worker thread 2
worker thread 1
server thread
process at node 2
parameters
worker thread 3
worker thread 2
worker thread 1
server thread
process at node 3
parameters
worker thread 3
worker thread 2
worker thread 1
server thread
Figure 3.4: PS architecture with server and worker threads co-located in one
process per node.
3.2 The Lapse Parameter Server
To explore the suitability of PSs with DPA as well as architectural design
choices, we created Lapse. Lapse is based on PS-Lite (Mu Li et al. 2014a) and
aims to fulfill the requirements established in the previous sections. In partic-
ular, Lapse provides fast access to local parameters, consistency guarantees
similar to classic PSs, and efficient parameter relocation. We start with a brief
overview of Lapse and subsequently discuss individual components, including
parameter relocation, parameter access, consistency, location management,
granularity, and important implementation aspects.
3.2.1 Overview
Lapse co-locates worker and server threads in the same process, as illustrated
in Figure 3.4, because this architecture facilitates low-latency local parameter
access (see below).
API
Lapse adds a single primitive called
localize
to the API of the PS; see Ta-
ble 3.1. The primitive takes the keys of one or more parameters as arguments.
When a worker issues a
localize
, it requests that all provided parameters are
relocated to its node. Lapse then transparently relocates these parameters and
future accesses by the worker require no further network communication.
Algorithm 3.1 depicts one example of how the primitive can be used, for
implementing the asynchronous distributed SGD example from Section 2.2.
In this example, the data loader initiates the parameter relocation (line 6)
when the batch is prepared. Batch preparation runs pipeline parallel with
the training (see Figure 2.1 on page 9). Thus, the relocation for a batch is
34
3.2. The Lapse Parameter Server
Table 3.1: Primitives of Lapse, a PS with dynamic parameter allocation. The
push primitive is cumulative. All primitives can run synchronously or asyn-
chronously. Compared to classic PSs, Lapse adds one primitive to initiate pa-
rameter relocations.
Primitive Support for Description
sync. async.
pull(parameters) Ø Ø Read the values of parameters
from the corresponding servers.
push(parameters, updates) Ø Ø Send updates for parameters
to the corresponding servers.
localize(parameters) Ø Ø Request local allocation
of parameters.
triggered shortly before the training on this batch starts, so that (ideally)
relocation finishes before the worker thread reads and writes the parameters
(lines 8 and 10, respectively).
We opted for the
localize
primitive—instead of a more general prim-
itive that allows for relocation among arbitrary nodes—because it is sim-
pler and sufficiently expressive to support PAL techniques. Furthermore,
localize
preserves the PS property that two workers logically interact only
via the servers (and not directly) (Mu Li et al. 2014a). Workers access localized
parameters in the same way as non-localized parameters. This allows Lapse
to relocate parameters without affecting workers that use them.
Location Management
Lapse manages parameter locations with a decentralized home node approach
(Steen and Tanenbaum 2017): for each parameter, there is one owner node
that stores the current parameter value and one home node that knows the
parameter’s current location. The home node is assigned statically as in
existing PSs, whereas the owner node changes dynamically during run time.
We further discuss location management in Section 3.2.5.
Parameter Access
Lapse ensures that local parameter access is fast by accessing local parameters
via shared memory. For non-local parameter access, Lapse sends a message to
the home node, which then forwards the message to the current owner of a
parameter. Lapse optionally supports location caches, which eliminate the
message to the home node if a parameter is accessed repeatedly while it is not
relocated. See Section 3.2.3 for details.
35
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Algorithm 3.1:
Distributed asynchronous SGD with Lapse. Differ-
ences to the Classic PS implementation (Algorithm 2.3 on page 17)
are highlighted in green.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
t: the ID of this worker thread,
T: the number of total worker threads
1for epoch 1to num_epochs do
2b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
3B= []
4for i1to bdo
5Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
6localize ( keys(Bi) )
// training
7for i1to bdo
8w=pull ( keys(Bi))
9w=compute_update ( Bi,w)
10 push ( keys(Bi), w)
Parameter Relocation
A localize call requires Lapse to relocate the parameter to the new owner and
update the location information on the home node. Care needs to be taken
that push and pull operations that are issued while the parameter is relocated
are handled correctly. Lapse ensures correctness by forwarding all operations
to the new owner immediately, possibly before the relocation is finished. The
new owner simply queues all operations until the relocation is finished. Lapse
sends at most three messages for a relocation of one parameter and pauses
processing for the relocated parameter only for the time that it takes to send
one network message. The entire protocol is described in Section 3.2.2.
Consistency
In general, Lapse provides the sequential consistency guarantees of classic PSs
even in the presence of relocations. We show in Section 3.2.4 that location
caches may impact consistency guarantees. In particular, when location
caches are used, Lapse still provides sequential consistency for synchronous
operations, but only eventual consistency for asynchronous operations.
36
3.2. The Lapse Parameter Server
Parameter allocation:
Parameters
Time (s)
(a) Data clustering. Each parameter is relocated once, to the node that accesses
it most frequently.
Parameters
Time (s)
(b) Parameter blocking. Blocks of parameters are relocated at the beginning of
each subepoch.
Parameters
Time (s)
(c) Latency hiding. A parameter is relocated to a node whenever that node
accesses the parameter.
Figure 3.5: Parameter allocation in Lapse for three example workloads with
different PAL techniques. Each row corresponds to one parameter, the x-axis
depicts time, and colors indicate the current allocation of a parameter.
37
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
3.2.2 Parameter Relocation
A key component of Lapse is the relocation of parameters. It is important that
this relocation is efficient because PAL techniques may relocate parameters
frequently (up to
36000
keys and
289
million parameter values per second in
our experiments). Figure 3.5 illustrates how Lapse relocates parameters for
example workloads with the data clustering, parameter blocking, and latency
hiding PAL techniques. In the following, we discuss how Lapse relocates
parameters, how it manages operations that are issued during a relocation,
and how it handles simultaneous relocation requests by multiple nodes.
During a localize operation, (1) the home node needs to be informed
of the location change, (2) the parameter needs to be moved from its current
owner to the new owner, and (3) Lapse needs to stop processing operations
at the current owner and start processing operations at the new owner. Key
decisions are what messages to send and how to handle operations that are
issued during parameter relocation. Lapse aims to keep both the relocation
time and the blocking time for a relocation short. We use relocation time
to refer to the time between issuing a localize call and the moment when
the new owner starts answering operations locally. By blocking time, we
mean the time in which Lapse cannot immediately process operations for
the parameter (but instead queues operations for later processing).
1
The
two measures usually differ because the current owner continues to process
operations for some time after the localize call is issued at the requesting
node, i.e., the relocation time may be larger than the blocking time.
We refer to the node that issued the localize operation as the requester
node. The requester node is the new owner of the parameter after the reloca-
tion has finished. Lapse sends three messages in total to relocate a parameter,
see Figure 3.6.
1
The requester node informs the home node of the parameter
about the location change. The home node updates the location information
immediately and starts routing parameter accesses for the relocated param-
eter to the requester node.
2
The home node instructs the old owner to
stop processing parameter accesses for the relocated parameter, to remove
the parameter from its local storage, and to transfer it to the requester node.
3
The old owner hands over the parameter to the requester node. The re-
quester node inserts the parameter into its local storage and starts processing
parameter accesses for the relocated parameter.
During the relocation, the requester node queues all parameter accesses
thatinvolvethe relocatedparameter. It queues bothlocal accesses(i.e., accesses
1
Both pull and push operations need to be queued during this period. However, as push
operations are typically run asynchronously (i.e., the application continues execution before
the push returns), the blocking time affects pull operations more directly.
38
3.2. The Lapse Parameter Server
re-
quester
node
owner
node home
node
1
request
relocation
2
instruct
relocation
3
relocate
holds parameter
requests localization
manages location
Figure 3.6: A worker requests to localize a parameter. Lapse transparently
relocates the parameter from the current owner to the requester node and
informs the home node of the location change.
by workers at the requester node) and remote accesses that are routed to it
before therelocationis finished. Once therelocationis completed, itprocesses
the queued operations in order and then starts handling further accesses as
the new owner. As discussed in Section 3.2.4, this approach ensures that
sequential consistency is maintained.
In the absence of other operations, the relocation time for this protocol
is approximately the time for sending three messages over the network, and
the blocking time is the time for sending one message (because operations are
queued at the requester and the home node starts forwarding to the requester
immediately). One may try to reduce blocking time by letting the old owner
process operations until the relocation is complete (and forwarding all updates
to the new owner). However, such an approach would require additional
communication and would increase relocation time. The protocol used by
Lapse strikes a balance between short relocation and short blocking time.
If multiple nodes simultaneously localize the same parameter, there is a
localization conflict: without replicas, a parameter resides at only one node at
a time. In the case of a localization conflict, the above protocol transfers the
parameter to each requesting node once (in the order the relocation requests
arrive at the home node). This gives each node a short opportunity to process
the parameter locally, but also causes communication overhead for frequently
localized parameters (because it repeatedly transfers the parameter value, po-
tentially in cycles). A short localize moratorium, in which further localize
requests are ignored, may reduce this cost, but would change the semantics
of the localize primitive, increase complexity, and may impact overall effi-
ciency. Additionally, a centralized protocol, i.e., a protocol in which workers
request localization (rather than explicitly triggering relocations) and a cen-
tral instance decides where to allocate the parameter, could avoid frequent
relocations, e.g., by keeping the parameter at the node with the most requests
39
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
in some time period. We did not consider these approaches in Lapse. We do,
however, pick up these considerations in the following chapters of this thesis:
we (i) explore how localization conflicts can be avoided by replicating (some
of) the parameters rather than relocating them (at the cost of introducing
staleness and, thus, falling back to weaker consistency guarantees) (Chapter 4)
and (ii) develop an approach in which workers indeed request parameters
(we call these requests intent signals) and the requests for one parameter are
coordinated centrally to decide where this parameter should be allocated and
replicated (Chapter 5).
3.2.3 Parameter Access
When effective PAL techniques are used, the majority of parameter accesses
are processed locally. Nevertheless, remote access to all parameters may arise
at all times and needs to be handled appropriately. We now discuss how Lapse
handles local access, remote access, location caches, and access to a parameter
that is currently relocating.
Local Access
Lapse provides fast local parameter access by accessing locally stored parame-
ters via shared memory directly from the worker threads, i.e., without involv-
ing thePS thread(see Figure3.4) orothernodes. Inourexperiments, accessing
the parameter storage via shared memory provided up to 6x lower latency
than access via a PS thread using queues (as implemented in Petuum (Xing
et al. 2015), for example). As other PSs, Lapse guarantees per-key atomic
reads and writes;
2
it does so using latches (i.e., locks held for the time of the
operation) for local accesses (see Section 3.2.7).
Remote Access
We now discuss remote parameter access and first assume that there is no
location caching. There are two basic strategies. In the location request
strategy, the worker retrieves the current owner of the parameter from the
home node and subsequently sends the pull or push request to that owner
(Figure 3.7a). In the forward strategy, the worker sends the request itself to the
home node, which then forwards it to the current owner (Figure 3.7b). Lapse
employs the forward strategy because (i) it always uses up-to-date location
information for routing decisions and (ii) it requires one message less than
location request. The forward strategy uses the latest location information for
routing because the home node, which holds the location information, sends
2Such per-key atomic access is common even in asynchronous SGD (Niu et al. 2011).
40
3.2. The Lapse Parameter Server
re-
quester
old
owner
owner home
4
send
value
3
re-
quest
value
1
ask
location
2
reply
location
(a) Location request
re-
quester
old
owner
owner home
1
request
value
2
request
value
(forward)
3
send
value
(b) Forward
re-
quester
old
owner
owner
home
1
request
value
2
send
value
(c) Correct location cache
re-
quester
old
owner
owner home
1
request
value 2
request
value
(forward)
3
request
value
(forward)
4
send
value
(d) Stale cache: double-forward
Figure 3.7: Routing for non-local parameter access. If the location of a param-
eter is unknown, Lapse employs the forward strategy (Figure b), requiring 3
messages. Lapse optionally supports location caches. A correct cache reduces
the number of messages to 2 (Figure c), a stale cache increases it to 4 (Figure d).
The figures depict labels for pull messages. Labels for push are analogously: re-
quest value send update,send value confirm update,request value (forward)
send update (forward)
41
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
the request to the owner (message
2
). In contrast, in location request, the
requester node sends the request (message
3
) based on the location obtained
from the home node. This location may be outdated if another worker
requests a relocation after the home node replied (message
2
). In this case,
the requester node may send message
3
to an outdated owner; such a case
would require special handling.
Location Caching
Lapse provides the option to cache the locations of recently accessed parame-
ters. This allows workers to contact the current owner directly (Figure 3.7c),
reducing the number of necessary messages to two. To avoid managing cached
locations and sending invalidation messages, the location caches are updated
only after
push
and
pull
operations and after parameter relocations (i.e.,
without any additional messages). As a consequence, the cache may hold
stale entries. If such an entry is used, Lapse uses a double-forward approach,
which increases the number of sent messages by one (Figure 3.7d).
Access During Relocation
Workers can issue operations for any parameter at any time, including when
a parameter is relocating. In the following, we discuss how Lapse handles
different possible scenarios of operations on a relocating parameter. First, sup-
pose that the requester node (to which the parameter is currently relocating)
accesses the parameter. Lapse then locally queues the request at the requester
node and processes it when the relocation is finished. Second, suppose that
the old owner accesses the parameter. Lapse processes the parameter access
locally if it occurs before the parameter leaves the local store. Otherwise,
Lapse sends the operation to the new owner and processes it there. Finally,
consider that a third node (neither the requester nor the old owner) accesses
the parameter. If location caches are disabled, there are two cases. (1) The
access arrives at the home node before the relocation. Then Lapse forwards
the access to the old owner and processes it there (before the relocation).
(2) The access arrives at the home node after the relocation. Then Lapse
forwards and processes it at the new owner. If necessary, the new owner
queues the access until the relocation is finished. With location caches, Lapse
additionally processes the request at the old owner if the parameter’s location
is cached correctly at the requester node and the access arrives at the old
owner before the relocation.
42
3.2. The Lapse Parameter Server
Table 3.2: Per-key consistency guarantees of PS architectures, using representa-
tives for types: PS-Lite (Mu Li et al. 2014a) for classic and Petuum (Xing et al.
2015) for replication.
Parameter Server Classic Lapse Replication
Synchronous sync async sync async sync, async
Location caches off on
Eventual Ø Ø Ø Ø Ø Ø
PRAMa(Lipton and Sandberg 1988) Ø ØbØ Øbר
Causal (Hutto and Ahamad 1990) Ø ØbØ Øb× ×
Sequential (Lamport 1979) Ø ØbØ Øb× ×
Serializability × × × × × ×
aI.e., monotonic reads, monotonic writes, and read your writes
b
Assuming that the network layer preserves message order (which is true for Lapse
and PS-Lite)
3.2.4 Consistency
In this section, we analyze the consistency properties of Lapse and compare
them to classic PSs, i.e., to PS-Lite (Mu Li et al. 2014a). Table 3.2 provides a
summary. Consistency guarantees affect the convergence of ML algorithms
in the distributed setting; in particular, relaxed consistency can slow down
convergence (Ho et al. 2013; Dai et al. 2015). The extent of this impact dif-
fers from task to task (Ho et al. 2013). None of the existing PSs guarantee
serializability, as pull and push operations of different workers can overlap
arbitrarily; neither do PSs give consistency guarantees across multiple keys.
PSs can, however, provide per-key sequential consistency. Sequential consis-
tency provides two properties (Lamport 1979): (1) each worker’s operations
are executed in the order specified by the worker, and (2) the result of any
execution is equivalent to an execution of the operations of all workers in
some sequential order. In the following, we study per-key sequential con-
sistency for synchronous and asynchronous operations. Replication PSs do
not provide sequential consistency. They provide weaker forms, specifically
PRAM consistency and eventual consistency (Ho et al. 2013). We assume in
the following that nodes process messages in the order they arrive (which is
true for PS-Lite and Lapse).
Synchronous Operations
A classic PS guarantees sequential consistency: it provides property (1) be-
cause workers block during synchronous operations, preventing reordering,
and (2) because all operations on one parameter are performed sequentially
by its owner.
43
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Theorem 1.
Lapse guarantees sequential consistency for synchronous operations.
Proof.
In the absence of relocations, Lapse provides sequential consistency,
analogously to classic PSs: it provides property (1) because workers block
during synchronous operations and property (2) because all operations on
one parameter are performed sequentially by the owner of this parameter.
In the presence of relocations, it provides property (1) because synchronous
operations also block the worker if a parameter relocates. It provides property
(2) because, at each point in time, only one node processes operations for one
parameter. During a relocation, the old owner processes operations until
the parameter leaves the local store (at this time, no further operations for
this key are in the old owner’s queue). Then the parameter is transferred
to the new owner, which then starts processing. The new owner queues
concurrent operations until the relocation is finished and then processes
them in sequence. Lapse enforces sequential execution among local threads
with latches and blocking operations.
Asynchronous Operations
A classic PS such as PS-Lite provides sequential consistency for asynchronous
operations.
3
Property (1) requires that operations reachthe responsibleserver
in program order (as the worker does not block during an asynchronous
operation). This is the case in PS-Lite as it sends each message directly to the
responsible server. Property (2) is given as for synchronous operations.
Theorem 2.
Lapse without location caches guarantees sequential consistency for
asynchronous operations.
Proof.
For property (1), first suppose that there is no concurrent relocation.
Lapse routes the operations of a worker on one parameter to the parameter’s
home node and from there to the owner. Message order is preserved in both
steps under our assumptions. Now suppose that the parameter is relocated
in-between operations. In this case, the old owner processes all operations
that arrive at the home node before the relocation. Then the parameter is
moved to the new owner, which then takes over and processes all operations
that arrive at the home node after the relocation. Again, message order is
preserved in all steps, such that Lapse provides property (1). It provides
property (2) by the same argument as for Theorem 1.
Theorem 3.
Lapse with location caches does not provide sequential consistency
for asynchronous operations.
3
We assume that the network layer preserves message order. This is the case in PS-Lite and
Lapse because they use TCP and send operations of a thread over the same connection.
44
3.2. The Lapse Parameter Server
Table 3.3: Location management strategies. Nis the number of nodes, Kis
the number of parameter keys. In practice, if one operation accesses multiple
keys, the total number of messages potentially scales sub-linearly because Lapse
groups messages when possible, see Section 3.2.7.
Strategy Storage Number of messages for
(per node) remote access relocation
Static partition 0 2 n/a
Broadcast operations 0 N0
Broadcast relocations K2N
Home node K/N3a3
a3 messages if uncached, 2 with correct cache, 4 with stale cache
Proof.
Lapse does not provide property (1) because a location cache change
can cause two operations to be routed differently, which can change message
order at the recipient. For example, consider two operations
O1
and
O2
of
one worker. The worker first issues operation
O1
, then operation
O2
. It
is possible that operation
O1
is sent to the currently cached, but outdated,
owner. Then the location cache is updated (by another returning operation)
and operation
O2
is sent directly to the current owner. With this, it is possible
that operation
O2
is processed before operation
O1
, because operation
O1
has to be double-forwarded to the current owner. This breaks sequential,
causal, and PRAM consistency.
3.2.5 Location Management
There are several strategies for managing location information in a PS with
DPA. Key questions are how to store and communicate knowledge about
which server is currently responsible for a parameter. Table 3.3 contrasts
several possible strategies. For reference, we include the static partitioning of
existing PSs (which does not support DPA). In the following, we discuss the
different strategies. We refer to the number of nodes as
N
and to the number
of keys as K.
Broadcast Operations
One strategy is to avoid storing any location information and instead broad-
cast the request to all nodes for each non-local parameter access. Then, only
the server that currently holds the parameter responds to the request (all other
servers ignore the message). This requires no storage but sends
N
messages
per parameter access (
N
1 messages to all other nodes, one reply back to
the requester). This high communication cost is not acceptable within a PS.
45
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Broadcast Relocations
An alternative strategy is to replicate location information to all nodes. This
requires to store
K
locations on each node (one for each of
K
keys). An
advantage of this approach is that only two messages are required per remote
parameter access (one request to the current owner of the parameter and
the response). However, storage cost may be high when there is a large
number of parameters and each location change has to be propagated to all
nodes. The simplest way to do this is via direct mail, i.e., by sending
N
2
additional messages to inform all nodes that were not involved in a parameter
relocation. Gossip protocols (Demers et al. 1987) could be used to reduce
this communication overhead.
Home Node
Lapse uses a home node strategy, inspired by distributed hash tables (Rat-
nasamy et al. 2001; Stoica et al. 2001; Rowstron and Druschel 2001; B. Zhao
et al. 2004) and home-based approaches in general (Steen and Tanenbaum
2017). The home node of a parameter knows which node currently holds
the parameter. Thus, if any node does not know the location of a parameter,
it sends a request to the home node of that parameter. As discussed in Sec-
tion 3.2.3, this requires at least one additional message for remote parameter
access. A home node is assigned to each parameter using static partitioning,
e.g., using range or hash partitioning. A simpler, but not scalable variant of
this strategy is to have one centralized home node that knows the locations
of all parameters. We discard this strategy because it limits the number of
model parameters to the size of one node and creates a scalability bottleneck
at the central home node.
Lapse employs the (decentralized) home node strategy because it requires
little storage overhead and sends few messages for remote parameter access,
especially when paired with location caches.
3.2.6 Granularity of Location Management
Location can be managed at different granularities, e.g., for each key or
for ranges of keys. Lapse manages parameter location per key and allows
applications to localize multiple parameters in a single localize operation.
This provides high flexibility, but can cause overhead if applications do
not require fine-grained location control. For example, parameter blocking
algorithms (Gemulla et al. 2011; Teflioudi et al. 2012; Yun et al. 2014) relocate
parameters exclusively in static (pre-defined) blocks. For such algorithms,
a possible optimization would be to manage location on group level. This
46
3.2. The Lapse Parameter Server
would reduce storage requirements and would allow the system to optimize
for communication of these groups. We do not consider such optimizations
because Lapse aims to support many PAL methods, including ones that
require fine-grained location control, such as latency hiding.
3.2.7 Important Implementation Aspects
In this section, we discuss implementation aspects that are key for the perfor-
mance or the consistency of Lapse.
Message Grouping
If a single push, pull, or localize operation includes more than one parameter,
Lapse groups messages that go to the same node to reduce network overhead.
For example, consider that one localize call relocates multiple parameters. If
two of the parameters are managed by the same home node, Lapse sends only
one message from requester to this home node. If the two parameters then
also currently reside at the same location, Lapse again sends only one message
from home node to the current owner and one back from the current owner
to the requester. Message grouping adds system complexity, but is highly
beneficial when clients access or localize sets of parameters at once.
Local Parameter Store
As other PSs (Ho et al. 2013; Mu Li et al. 2014a), Lapse provides two variants
for the local parameter store: dense arrays and sparse maps. Dense parameter
storage is suitable if parameter keys are contiguous; sparse storage is suitable
when they are not. Lapse uses a list of
L
latches to synchronize parameter
access, while allowing parallel access to different parameters. A parameter
with key
k
is protected by latch
kmodL
. Applications can customize
L
. A
default value of L=1000 latches worked well in our experiments.
No Message Prioritization
To reduce blocking time, Lapse could have opted to prioritize the processing
of messages that belong to parameter relocations. However, this prioritiza-
tion would break most consistency guarantees for asynchronous operations
(i.e., sequential, causal, and PRAM consistency). The reason for this is that
an “instruct relocation message could overtake a parameter access message
at the old owner of a relocation. The old owner would then reroute the
parameter access message, such that it potentially arrives at the new owner
after parameter access messages that were issued later. Therefore, Lapse does
not prioritize messages.
47
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
3.3 Experiments
We conducted an experimental study to investigate the efficiency of classic PSs
(Section 3.3.2) and whether it can be beneficial to integrate PAL techniques
into PSs (Section 3.3.3). Further, we investigated how efficient Lapse is in
comparison to a task-specific low-level implementation (Section 3.3.4) and
replicationPSs (Section3.3.5), and conducted anablation study (Section3.3.6).
Our source code, datasets, and information on reproducing our experiments
are available online.4
Our major insights are: (i) Classic PSs suffered from severe communica-
tion overhead compared to a single node: using Classic PSs, 2–8 nodes were
slower than 1 node in all tested tasks. (ii) Integrating PAL techniques into
the PS reduced this communication overhead: Lapse was 4–203x faster than a
classic PS, with 8 nodes outperforming 1 node by up to 9x. (iii) Lapse scaled
better than a state-of-the-art replication PS (8 nodes were 9x vs. 2.9x faster
than 1 node).
3.3.1 Experimental Setup
Tasks
We considered three popular ML tasks that require long training: matrix
factorization, knowledge graph embeddings, and word embeddings. Table 3.4
summarizes details about the models and the datasets that we used for these
tasks. We employ varied PAL techniques for the tasks. In the following, we
briefly discuss each task.
Matrix factorization
Low-rank matrix factorization is a common tool for
analyzing and modeling dyadic data, e.g., in collaborative filtering for
recommender systems (Koren et al. 2009). We employed a parameter
blocking approach (Gemulla et al. 2011) to create and exploit PAL:
communication happens only between subepochs; within a subepoch,
all parameter access is local. We implemented this algorithm in PS-
Lite (a classic PS), Petuum (a replication PS), and Lapse. Further, we
compared to a task-specific and tuned low-level implementation of
this parameter blocking approach.
5
We used two synthetic datasets
from (Makari et al. 2015), because the largest openly available dataset
that we are aware of is only
7.6GB
large.
6
For both datasets, we ran
4https://github.com/alexrenz/lapse-ps/tree/vldb20/
5https://github.com/uma-pi1/DSGDpp
6
We adopt these datasets from (Makari et al. 2015). Note that the revealed cells in both
datasets are distributed uniformly across rows and columns, whereas real-world datasets often
have skewed distributions (Meka et al. 2009). For experiments with skewed MF datasets, see
48
3.3. Experiments
Table 3.4: ML tasks, models, and datasets. The rightmost columns depict the
number of key accesses and the size of read parameters (per second, for a single
thread), respectively.
Task Model parameters Data Param. Access
Model Keys Values Size Dataset Data points Size Keys/s MB/s
Matrix Latent factors, rank 100 6.4 M 640M 4.8GB 3.4m×3m matrix 1000M 31GB 414k 315
factorization Latent factors, rank 100 11.0 M 1100M 8.2GB 10m×1m matrix 1000M 31GB 316k 241
Knowledge ComplEx, dim. 100 0.5M 98M 0.7GB DBpedia-500k 3M 47MB 312k 476
graph ComplEx, dim. 4000 0.5M 3929M 29.3GB DBpedia-500k 3M 47MB 11k 643
embeddings RESCAL, dim. 100 0.5M 110M 0.8GB DBpedia-500k 3M 47MB 12k 614
Word embeddings Word2Vec, dim. 1000 1.1M 1102M 4.1GB 1b word benchm. 375M 3GB 17k 65
a factorization of rank 100. In all PSs, we ran a global barrier after
each subepoch to ensure consistency. In Petuum, to ensure consistent
replicas, we issued one clock after each subepoch and set a staleness
threshold of 1. Petuum’s own matrix factorization implementation
ran out of memory because it stores dense matrices.
Knowledge graph embeddings
Knowledge graph embedding (KGE) mod-
els learn algebraic representations of the entities and relations in a
knowledge graph. For example, these representations have been ap-
plied successfully to infer missing links in knowledge graphs (Nickel et
al. 2016a). A vast number of KGE models has been proposed (Nickel et
al. 2011; Bordes et al. 2013; Nickel et al. 2016b; Bishan Yang et al. 2015;
H. Liu et al. 2017), with different training techniques (Ruffinelli et al.
2020). We studied two models as representatives: RESCAL (Nickel
et al. 2011) and ComplEx (Trouillon et al. 2016). We employed data
clustering and latency hiding to create and exploit PAL. We used the
DBpedia-500k dataset (Shi and Weninger 2018), a real-world knowledge
graph that contains
490598
entities and
573
relations of DBpedia (Auer
et al. 2007). We ran the common setting of SGD with AdaGrad (Duchi
et al. 2011) and negative sampling (Ruffinelli et al. 2020; H. Liu et al.
2017). We stored the AdaGrad metadata in the PS. In all experiments,
we generated negative samples by perturbing both subject and object
of positive triples 10 times. We set the initial learning rate for Ada-
Grad to
0.1
. We used data clustering to create and exploit PAL for
relation parameters, and latency hiding for entity parameters. For the
relation parameters, we partitioned the training dataset by relation and
allocated each relation parameter at the node that uses it, such that
our studies in Sections 4.4 and 5.4.
49
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
all accesses to relation parameters are local. Regarding entity param-
eters, each worker pre-localizes all parameters that it requires for the
batch that follows the current one (where one batch consists of one
subject—relation—object training example and 10 subject-perturbed
and 10 object-perturbed negative samples). The transfer of these param-
eters then overlaps with the computation for the current batch. We
tried looking further into the future, e.g., localizing the parameters of
a batch 2, 3, 10, or 100 batches into the future. We observed similar
speed-ups for 2 and 3 and lower speed-ups for 10 and 100.
Word embeddings
Word embeddings are a language modeling technique in
natural language processing: each word of a vocabulary is mapped to
a vector of real numbers (Mikolov et al. 2013; Pennington et al. 2014;
Peters et al. 2018). These vectors are useful as input for many natural
language processing tasks, for example, syntactic parsing (Socher et al.
2013) or question answering (X. Liu et al. 2018). In our experimental
study, we used the skip-gram Word2Vec (Mikolov et al. 2013) model
and employed latency hiding to create and exploit PAL. We used the
One Billion Word Benchmark (Chelba et al. 2013) dataset, with stop
words of the Gensim (ˇ
Reh˚
uˇ
rek and Sojka 2010) stop word list removed.
We used common model parameters (Mikolov et al. 2013) of embedding
size
1000
, window size 5, minimum count 2, negative sampling with 25
samples, and 1e-5 frequent word subsampling. We used a latency hiding
approach that pre-localizes parameters for the words of a sentence when
it reads the sentence. As negative samples, our approach chooses only
parameters that are (currently) available locally. This pool of locally
available parameters changes constantly as parameters are relocated
whentheyoccurinsentences. Thisapproachchangesthelocal sampling
distribution of negative examples at one node. However, it mostly
preserves the global sampling distribution, as each parameter is local at
exactly one node (except that frequent parameters are sampled under-
proportionately, because they relocate more often and are sampled
nowhere during a transfer). To implement this scheme, Lapse exposes
an additional
PullIfLocal
primitive. We will discuss such sampling
access, available schemes to reduce communication overhead (such as
the local sampling scheme used here), and a principled approach to
sampling support in PSs in detail in Section 4.3.
50
3.3. Experiments
Implementation and Cluster
We implemented Lapse in C++, using ZeroMQ and Protocol Buffers for
communication, drawingfromPS-Lite(Mu Liet al. 2014a). Weranversion1.1
of Petuum
7
(Xing et al. 2015) and the version of Sep 1, 2019 of PS-Lite (Mu Li
et al. 2014a). We used a local cluster of 8 Dell PowerEdge R720 computers,
running CentOS Linux 7.6.1810, connected with 10 GBit Ethernet. Each
node was equipped with two Intel Xeon E5-2640 v2 8-core CPUs, 128 GB of
main memory, and four 2 TB NL-SAS 7200 RPM hard disks. We compiled
all code with gcc 4.8.5.
Settings and Measures
In all experiments, we used 1 server and 4 worker threads per node
8
and stored
all model parameters in the PS, using dense storage. Each key held a vector
of parameter values. We report Lapse run times without location caches,
because they had minimal effect in Lapse. The reason for this is that Lapse
localizes parameters and location caches are not beneficial for local parameters
(see Section 3.3.6 for details). For all tasks but word embeddings, we measured
epoch run time, because the different variants that we run are identical (or
near-identical) with respect to convergence, such that the only difference
among these variants is in epoch run time. This allowed us to conduct
experiments in more reasonable time. For word embeddings, epochs are not
identical because the chosen latency hiding approach changes the sampling
distribution of negative samples. Thus, we measure model accuracy over time.
We calculated model accuracy using a common analogical reasoning task of
19544
semantic and syntactic questions (Mikolov et al. 2013). We conducted
3 independent runs of each experiment and report the mean. Error bars
depict the minimum and maximum. In some experiments, error bars are not
clearly visible because of small variance. Gray dotted lines indicate run times
of linear scaling.
3.3.2 Performance of Classic Parameter Servers
We investigated the performance of classic PSs and how it compares to the
performance of efficient single node implementations. To this end, we mea-
sured the performance of a classic PS on 1–8 nodes for matrix factorization
(Figure 3.8), knowledge graph embeddings (Figure 3.9), and word embed-
dings (Figure 3.10). Besides PS-Lite, we ran Lapse as a classic PS (with shared
7
In consultation with the Petuum authors, we fixed an issue in Petuum that prevented
Petuum from running large models on a single node.
8For higher degrees of parallelism, see the experiments in Sections 4.4 and 5.4.
51
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Classic PS (PS−Lite)
Classic PS with fast local access (in Lapse)
Lapse
1
10
100
1000
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(a) 10m×1m matrix, 1b entries
1
10
100
1000
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(b) 3.4m×3m matrix, 1b entries
Figure 3.8: Performance for matrix factorization. Lapse scaled linearly because
it exploits PAL. Classic PS approaches displayed significant communication
overhead over the single node. The classic PS approach in Lapse drops in per-
formance because it is efficient on a single node, see Section 3.3.2. The gray
dotted lines indicate linear scaling. Error bars depict minimum and maximum
run time (hardly visible here because of low variance).
memory access to local parameters). To do so, we disabled DPA, such that
parameters are allocated statically. We used random keys for the parameters
in both implementations.
9
We omitted PS-Lite from the word embeddings
task due to its long run time.
Multi-Node Performance
The performance of classic PSs was dominated by communication over-
head: in none of the tested ML tasks did 2–8 nodes outperform a single
node.
Instead, 2–8 nodes were 22–47x slower than 1 node for matrix factor-
ization, 1.4–30x slower for knowledge graph embeddings, and 11x slower
for word embeddings. The two classic PS implementations displayed similar
performance on multiple nodes. With smaller numbers of nodes (e.g., on 2
nodes), the variant with fast local access can access a larger part of parameters
with low latency, and thus has a performance benefit. Further performance
differences stem from differences in the system implementations.
9
The performance of classic PSs depends on the (static) assignment of parameters. Both
implementations range partition parameters, which can be suboptimal if algorithms assign
keys to parameters non-randomly. Manually assigning random keys improved performance
for most tasks (and never deteriorated performance).
52
3.3. Experiments
Classic PS (PS−Lite) Classic PS with fast local access (in Lapse) Lapse, only data clustering Lapse
0
25
50
75
100
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(a) ComplEx-Small (dim. 100/100)
0
100
200
300
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(b) ComplEx-Large (dim. 4000/4000)
0
100
200
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(c) RESCAL-Large (dim. 100/10000)
Figure 3.9: Performance for training knowledge graph embeddings. Dis-
tributed training using the classic PS approach did not outperform a single
node in any task. Lapse scaled well for the large tasks (b and c), but not for the
small task (a).
Single-Node Performance
On a single node, the run times of PS-Lite and Lapse differed significantly (e.g.,
see Figure 3.8), because they access local parameters (i.e., all parameters when
using 1 node) differently. Lapse accesses local parameters via shared memory.
This was 71–91x faster than PS-Lite, which accesses local parameters via inter-
process communication.
10
The Classic PS with fast local access displayed the
same single-node performance as Lapse, as all parameters are local if only one
node is used (even without relocations). This single-node efficiency is the
reasonforitsperformancedropfrom1to2nodes. Comparingdistributedrun
times only against inefficient single node implementations can be misleading.
Communication Overhead
The extent of communication overhead depended on the communication-
to-computation ratios of the different tasks. The two rightmost columns
of Table 3.4 give an indication of this ratio. They depict the number of key
accesses and size of read parameter data per second, respectively, measured
for a single thread on a single node for the respective task. For example,
ComplEx-Small (Figure 3.9a) accessed the PS frequently (312k accessed keys
per second) and displayed high communication overhead (8 nodes were 14x
10
PS-Lite provides an option to explicitly speed up single node performance by using
memory copy inter-process communication. In our experiments, this was still 47–61x slower
than shared memory.
53
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
Table 3.5: Parameter reads, relocations, and relocation times in ComplEx-
Large. In this task, each key holds a vector of 8000 doubles. All parallelism
levels read 196 million keys in one epoch. On 2 nodes, mean RT is short as
every relocation involves only 2 nodes (instead of 3).
Nodes Reads (keys/s) Relocations Mean RTa
Total Local Non-local (keys/s) (ms)
1 36 k 36k 0.0k 0k -
2 72k 72 k 0.0k 12k 2.4
4 104k 102k 1.6k 27k 6.9
8 121k 118k 2.5k 36k 7.7
aRelocation time, see Section 3.2.2
slower than 1 node). ComplEx-Large (Figure 3.9b) accessed the PS less fre-
quently (11k accesses keys per second) and displayed lower communication
overhead (8 nodes were 1.4x slower than 1 node).
3.3.3 Effect of Dynamic Parameter Allocation
We compared the performance of Lapse to a classic PS approach for matrix fac-
torization (Figure 3.8), KGE (Figure 3.9), and word embeddings (Figure 3.10).
Lapse was 4–203x faster than classic PSs.
Lapse outperformed the single
node in all but one tasks (ComplEx-Small), with speed-ups of 3.1–9x on
8 nodes (over 1 node).
Matrix Factorization
In matrix factorization, Lapse was 90–203x faster than classic PSs and achieved
linearspeed-ups overthesinglenode (see Figure3.8). The reasonforthisspeed-
up is that classic PSs (e.g., PS-Lite) cannot exploit the PAL of the parameter
blocking algorithm. Thus, their run time was dominated by network latency.
Knowledge Graph Embeddings
In knowledge graph embeddings, Lapse was 4–26x faster than a classic PS
(see Figure 3.9). It scaled decently for the large tasks (ComplEx-Large and
RESCAL-Large). In particular for ComplEx-Large, scalability was limited by
localization conflicts on frequently accessed parameters. The probability of a
localization conflict, i.e., that two or more nodes localize the same parameter
at the same time, increases with the number of workers, see Table 3.5. In
the table, the number of localization conflicts is indicated by the number of
non-local parameter reads (which are caused by localization conflicts). For
54
3.3. Experiments
Approach
Classic PS with fast local access (in Lapse) Lapse
Parallelism 1x4 2x4 4x4 8x4
Approach Classic PS with fast local access (in Lapse) Lapse Parallelism 1x4 2x4 4x4 8x4
>24h
0
2
4
6
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in hours
(a) Epoch run time
best after 15 epochs
35.0
37.5
40.0
42.5
45.0
47.5
6 9 12 15
Epoch
Error
(b) Error over epochs 6–15
best after 15 epochs
40
50
60
70
80
0 20 40 60 80
Runtime in hours
Error
(c) Error over run time (all epochs)
Figure 3.10: Performance for training word embeddings. The classic PS ap-
proach did not scale (8 nodes were >4x slower than 1 node). In Lapse, 8 nodes
reached (for example) 39% error 3.9x faster than 1 node. The dashed horizontal
line indicates the best observed error after 15 epochs.
55
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
ComplEx-Small, distributed execution in Lapse did not outperform the single
node because of communication overhead.
We additionally measured performance of running Lapse with only
data clustering (i.e., without latency hiding). This approach accesses relation
parameters locally and entity parameters remotely. It improved performance
for RESCAL (Figure 3.9c) more than for ComplEx (Figures 3.9a and 3.9b),
because in RESCAL, relation embeddings have higher dimension (
10000
for
RESCAL-Large) than entity embeddings (
100
), whereas in ComplEx, both
are the same size (100 in ComplEx-Small and 4000 in ComplEx-Large).
Word Embeddings
For word embeddings, Lapse executed an epoch 44x faster than a classic
PS (Figure 3.10). Further, 8 nodes reached, for example, 39% error 3.9x
faster than a single node. The speed-up for word embeddings is lower than
for knowledge graph embeddings, because word embeddings training ex-
hibits strongly skewed access to parameters: few parameters are accessed
frequently (Mikolov et al. 2013). This stronger skew lead to more frequent
localization conflicts in the latency hiding approach than in knowledge graph
embeddings, where at least negative samples are sampled uniformly (Ruffinelli
et al. 2020; H. Liu et al. 2017). We will investigate how PSs can be efficient
for such strongly skewed access patterns in the subsequent chapter of this
thesis (see Section 4.2).
3.3.4 Comparison to Manual Management
We compared the performance of Lapse to a highly specialized and tuned
low-level implementation of the parameter blocking approach for matrix
factorization, see Figure 3.11. This low-level implementation cannot be used
for other ML tasks; i.e., it is a competitive baseline specifically for this one
ML task. Both the low-level implementation and Lapse scaled linearly.
Lapsehad only 2.0–2.6x generalizationoverheadover thelow-levelimple-
mentation. The reason for the overhead is that the low-level implementation
exploits task-specific properties that a PS cannot exploit in general if it aims to
provide PS consistency and isolation guarantees for a wide range of ML tasks.
I.e., the task-specific implementation lets workers work directly on the data
store, without copying data and without concurrency control. This works
for this particular algorithm, because each worker focuses on a separate part
of the model (at a time), but is not applicable in general. In contrast, Lapse
and other PSs copy parameter data out of and back into the server, causing
overhead over the task-specific implementation. Additionally, the low-level
56
3.3. Experiments
Stale PS (Petuum)
client sync.
Stale PS (Petuum),
server sync., warm−up
Stale PS (Petuum),
server sync.
Lapse Low−level implementation,
specialized and tuned
0
10
20
30
40
50
60
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(a) 10m×1m matrix, 1b entries
0
10
20
30
40
1x4 2x4 4x4 8x4
Parallelism (nodes x threads)
Epoch run time in minutes
(b) 3.4m×3m matrix, 1b entries
Figure 3.11: Performance comparison to manual parameter management and
to Petuum, a state-of-the-art replication PS, for matrix factorization. Lapse
and manual parameter management (using a specialized and tuned low-level
implementation) scale linearly, in contrast to Petuum. For the
10
m×
1
m ma-
trix, Petuum crashed with a network error on 2 nodes.
57
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
implementation focuses on and optimizes for communication of blocks of
parameters (which Lapse does not) and does not use a key–value abstraction
for accessing keys.
Implementing the parameter blocking approach was significantly easier
in Lapse than using low-level programming. The low-level implementation
manually moves parameters from node to node, using MPI communication
primitives. This manual parameter allocation required 100s of lines of MPI
code. In contrast, in Lapse, the same parameter allocation required only 4
lines of additional code.
3.3.5 Comparison to Replication PSs
We compared Lapse to Petuum, a popular replication PS, for matrix factoriza-
tion (see Figure 3.11).
We found that the replication PS was 2–28x slower
than Lapse and did not scale linearly, in contrast to Lapse.
Petuum provides bounded staleness consistency. As discussed in Sec-
tion 3.1.1, this can support synchronous parameter blocking algorithms, such
as the one we test for matrix factorization. We compared separately to the
two synchronization protocols that Petuum provides: SSP (Ho et al. 2013)
and ESSP (Dai et al. 2015). On the 10
m×
1
m
dataset, Petuum crashed on
two nodes with a network error.
SSP
Petuum with the SSP synchronization protocol outperformed the classic PS,
but was 2.5–28x slower than Lapse. The main reason for the overhead was
network latency for synchronizing parameters when their value became too
stale. This approach did not scale because the number of synchronizations
per worker was constant when increasing the number of workers (due to the
increasing number of subepochs).
ESSP
Petuum with the ESSP synchronization protocol outperformed the classic PS,
but was 2–4x slower than Lapse, and only 2.9x faster on 8 nodes than Lapse on
1 node. The reason for this is that after every global clock advance, Petuum’s
ESSP synchronization eagerly synchronizes to a node all parameters that
this node accessed previously. On the one hand, this eliminated the network
latency overhead of ESSP synchronization. On the other hand, this caused
significant unnecessary communication, causing the overhead over Lapse
and preventing linear scale-out: in each subepoch, each node accesses only
a subset of all parameter blocks, but Petuum replicates all blocks. Petuum
58
3.4. Related Work
“learns” which parameters to replicate to which node in a slower warm-up
epoch, depicted separately in Figure 3.11.
3.3.6 Ablation Study
DPA and Fast Local Access
Lapse differs from classic PSs in two ways: (1) DPA and (2) shared memory ac-
cess to local parameters. To investigate the effect of each difference separately,
we compared the run time of three different variants: Classic PS (PS-Lite)
(neither DPA nor shared memory), Classic PS with fast local access (in Lapse)
(no DPA, but shared memory), and Lapse (DPA and shared memory). The
run times of these variants can be compared in Figures 3.8 and 3.9. Without
DPA, shared memory had limited effect, as many parameters were non-local
and access times were thus dominated by network latency (except for the
single-node case, in which all parameters are local). Combining DPA and
shared memory yielded better performance: DPA ensures that parameters
are local and shared memory ensures that access to local parameters is fast.
Location Caching
All figures report run times of Lapse without location caching. We investi-
gated the effect of location caching, which Lapse supports optionally. We
observed similar run times with location caching. For example, for KGE
(Figure 3.9), Lapse was max. 3% faster and max. 2% slower with location
caching than without. The reason for this is that location caches speed up only
remote parameter accesses (see Section 3.2.3). The latency hiding approach
in KGE, however, localizes all parameters before they are used, such that the
vast majority of parameter accesses are local (see Table 3.5). For matrix factor-
ization, location caching had no effect at all, because all parameter accesses
were local (due to the parameter blocking approach). In the Classic PS variant
of Lapse, location caches had no effect because parameters remained at their
home nodes throughout training (as they do in other classic PSs).
3.4 Related Work
We discuss related work on distributed parameter management in Section 2.2
and other related work in Section 2.3. In the following, we further discuss
work that is specifically related to dynamic parameter allocation.
59
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
3.4.1 Dynamic Parallelism
FlexPS (Yuzhen Huang et al. 2018) reduces communication overhead by
executing different phases of an ML task with different levels of parallelism
and moving parameters to active nodes. However, it provides no location
control, moves parameters only between phases, and pauses the training
process for the move. This reduces communication overhead for some ML
tasks, but is not applicable to many others, e.g., the tasks that we consider
in this thesis. FlexPS cannot be used for PAL techniques because it does
not provide fine-grained control over the location of parameters. Lapse, in
contrast, is more general: it supports the FlexPS setting, but also provides
fine-grained location control and moves parameters without pausing workers.
3.4.2 Dynamic Allocation in Key–Value Stores
DAL (Németh et al. 2017), a general-purpose key–value store, dynamically al-
locates a data item at the node that accesses it most (after an adaptation period).
In theory, DAL could exploit data clustering and synchronous parameter
blocking PAL techniques, but no others (due to the adaptation period). How-
ever, DAL accesses data items via inter-process communication, such that
access latency is too high to exploit PAL in ML algorithms. Husky (F. Yang
et al. 2016) allows applications to move data items among nodes and provides
fast access to local data items. However, local data items can be accessed by
only one worker (i.e., Husky does not provide global read and write access
to the data items). Thus, Husky can exploit only PAL techniques in which
each parameter is accessed by only one worker, i.e., parameter blocking, but
not data clustering or latency hiding.
3.5 Summary
In this chapter, we explored whether and to what extent PSs can exploit lo-
cality, and whether doing so can be beneficial. To this end, we proposed that
the PS adaptively relocates parameters during run time, according to where
the parameters are currently accessed. This allows PSs to support PAL tech-
niques directly. We found that dynamic allocation can reduce communication
overhead of PSs significantly for some ML tasks.
In particular, dynamic allocation allows efficient distributed training
for algorithms that explicitly create locality, i.e., for the data clustering or
parameter blocking PAL techniques. However, for some tasks, no such al-
gorithms exist, either because they have not been developed yet or because
they are infeasible for the task. For such tasks, only the latency hiding PAL
60
3.5. Summary
technique is available. However, as seen in Section 3.3, the efficiency of the
latency hiding technique can be limited by localization conflicts, especially
when parameter access frequency is strongly skewed towards a few frequently
accessed parameters. In the experimental study of this chapter, we saw ini-
tial evidence of this limitation, even though we were using relatively small
degrees of parallelism (only 4 worker threads per node). In the upcoming
Chapter 4, we will observe stronger impacts of this limitation (by evaluating
PS efficiency in harder-to-scale conditions, e.g., with more skew and higher
levels of parallelism), will explore support for the latency hiding technique
further, and try to improve efficiency for real-world ML tasks.
We showed that Lapse can be more efficient than a classic PS. However,
Lapse is also more complex to use than a classic PS. In particular, applications
need to manually trigger relocations by adding
localize
invocations to
application code (see Algorithm 3.1 on page 36 for an example). They need
to decide which parameters to relocate and when to do so (i.e., how large
the relocation offset between the actual parameter access and the initiation of
the relocation should be). For the data clustering and parameter blocking
PAL techniques, both decisions are usually straightforward. In contrast, for
the latency hiding technique, the timing is often not straightforward, as it
directly affects the number of localization conflicts (and, thus, performance).
Relocation should be early enough to finish before the actual access, but
not much earlier than necessary because that would increase the chance of
localization conflicts. For optimal performance, applications might need to
tune these decisions, making Lapse complex to use. In Chapter 5, we discuss
how adaptive PSs can be made easier to use.
61
Chapter 3. Exploiting Locality: Dynamic Parameter Allocation
62
Chapter 4
Handling Diversity:
Non-Uniform Parameter
Management
In this chapter, we further investigate the efficiency of PSs for tasks without
explicit locality (e.g., through data clustering or parameter blocking). The
latency hiding technique is applicable to such tasks, but its efficiency can be
limited by localization conflicts (see Section 3.3). We observe that a key cause
for limited performance in real-world ML tasks is non-uniform parameter
access. We identify two main sources of non-uniformity: skew and sampling.
First, in a workload that exhibits skew, a (typically small) subset of pa-
rameters is accessed frequently (e.g., up to
100000
times per second), whereas
a large part of the parameters is accessed rarely (e.g., only once every minute)
(Meka et al. 2009; W. Cheng et al. 2016; Gonzalez et al. 2012; Faloutsos et al.
1999; Moreno-Sánchez et al. 2016; Clauset et al. 2009). The main reason for
skew is that real-world datasets often have skewed frequency distributions
(e.g., graphs (W. Cheng et al. 2016; Gonzalez et al. 2012; Faloutsos et al. 1999),
texts (Moreno-Sánchez et al. 2016), and others (Clauset et al. 2009; Meka
et al. 2009)), and many ML models associate specific parameters with specific
data items (e.g., with the tokens in a text document or with the vertices of a
graph (Mikolov et al. 2013; Nickel et al. 2011; Koren et al. 2009)).
Existing PSs are inefficient for managing skew because they employ one
single management technique for all parameters. Using a single technique
limits performance as none of the existing techniques is efficient for all ac-
cess patterns. To overcome this limitation, we introduce multi-technique
parameter management, i.e., to adapt the parameter management technique
to the access pattern of a parameter. The PS provides multiple management
techniques and chooses a suitable technique for each parameter. Our proto-
63
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
type implementation NuPS integrates parameter relocation (as presented in
Chapter 3) and parameter replication (Ho et al. 2013; Dai et al. 2015).
The second source of non-uniformity is sampling: for a subset of pa-
rameter accesses, random sampling (rather than training data) determines
which parameters are read and written (Mikolov et al. 2013; Ruffinelli et al.
2020; H. Liu et al. 2017; Rendle et al. 2009; Bamler and Mandt 2020; Chechik
et al. 2010; Schroff et al. 2015). One common reason for this access pattern is
negative sampling (Mikolov et al. 2013; Bamler and Mandt 2020; Ruffinelli
et al. 2020; Grover and Leskovec 2016), which, for example, is used to reduce
the cost of many-class classification tasks or to mitigate an absence of negative
training data (e.g., in recommender systems with only positive feedback or
in knowledge graphs that contain only positive edges). For example, when
training knowledge graph embeddings, it is common to randomly perturb
either the subject, the relation, or the object of subject–relation–object train-
ing examples to obtain negative training examples (Ruffinelli et al. 2020). For
instance, from the training example Marie Curieis ascientist, you could
obtain the negative training example Marie Curieis abutterfly by randomly
perturbing the object of the training example (scientist).
Existing PSs are inefficient for sampling because common parameter
management techniques are ill-suited for randomly sampled access. To im-
prove performance, applications can implement specialized sampling schemes
manually, outside the PS (Lerer et al. 2019; D. Zheng et al. 2020b; Ji et al. 2019;
Stergiou et al. 2017), but this limits the efficiency of some schemes, potentially
produces incorrect samples, and causes repeated implementation effort. We
propose to overcome this limitation by integrating sampling schemes directly
into the PS. To do so, we extend the PS API with a sampling primitive that
allows applications to request samples from a specific sampling distribution
(rather than accessing specific parameters directly). A sampling manager trans-
parently chooses one of several sampling schemes to reduce communication
overhead for sampling, according to a conformity level. Conformity levels
provide a controlled trade-off between efficiency and sample quality.
NuPS, our implementation of a non-uniform PS, implements both multi-
technique parameter management and sampling support, see Figure 4.1. In
our experimental evaluation, NuPS outperformed state-of-the-art PSs by up
to one order of magnitude and provided good scalability across multiple tasks.
We begin this chapter by introducing skew and sampling in detail in
Section 4.1. Section 4.2 explores how PSs can efficiently handle skew. Sec-
tion 4.3 explores how PSs can efficiently handle sampling. Finally, Section 4.4
experimentally investigates how non-uniform parameter management affects
PS performance. Section 4.5 concludes the chapter with an interim summary.
64
4.1. Non-Uniform Parameter Access
NuPS
Direct Access API Sampling Access API
Multi-Technique
Parameter Manager
(Section 4.2)
Replication Relocation
Sampling Manager
(Section 4.3)
Conformity
levels
Sampling
schemes
Figure 4.1: NuPS components. NuPS differs from existing PSs in two main
ways: it introduces (i) multi-technique parameter management to handle skew
and (ii) a sampling manager and API to handle sampling.
4.1 Non-Uniform Parameter Access
We study ML tasks that exhibit non-uniform parameter access. We identify
two main sources of non-uniform parameter access: skew (Section 4.1.1) and
sampling (Section 4.1.2).
4.1.1 Skew
A workload exhibits skew non-uniformity when some parts of the model
are accessed (much) more frequently than others. The main reason for this
is that many real-world datasets have skewed frequency distributions (Meka
et al. 2009; W. Cheng et al. 2016; Mohamed et al. 2020; Gonzalez et al. 2012;
Faloutsos et al. 1999; Moreno-Sánchez et al. 2016; Clauset et al. 2009). For
example, heavy skew is common in text corpora, because word frequencies
are skewed (Moreno-Sánchez et al. 2016), and in graph data, because in- and
out-degree distributions are skewed (W. Cheng et al. 2016; Mohamed et al.
2020; Gonzalez et al. 2012; Faloutsos et al. 1999). As many ML models
associate specific parameters with specific data items (e.g, with words in a
text or with the nodes of a graph) (Mikolov et al. 2013; Nickel et al. 2011;
Koren et al. 2009; Grover and Leskovec 2016), access to the parameters is
heavily skewed, too: a small subset of hot spot parameters is accessed frequently,
whereas the majority of parameters is accessed rarely. In the following, we
will refer to the parameters that are not hot spots as long tail parameters.
We have measured the extent of skew for two real-world ML tasks: train-
ing knowledge graph embeddings and training word embeddings. The left
hand sides of Figures 4.2a and 4.2b show the number of reads per parameter
over one epoch of these tasks, respectively. Access is heavily skewed: in the
knowledge graph embeddings task, 18% of 12.9 trillion total reads go to only
65
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
(a) Knowledge graph embeddings
(b) Word embeddings
Figure 4.2: Number of accesses per parameter in one epoch. Parameters are
sorted by decreasing total number of accesses. See Section 4.4.1 for details on
tasks and experimental setup.
0.02% of 4.8 billion parameters. In the word embeddings task, 45% of 9
trillion total reads go to 0.17% of 1.9 billion parameters. Details on the tasks
and datasets can be found in Section 4.4.1.
Notethatskewisnotalwayspresentindistributedtraining. Forexample,
there is no skew in convolutional neural networks for image recognition (Le-
Cun et al. 1989) because model access is dense, i.e., every update step writes to
all parameters.
1,2
In contrast, in common neural network models for natural
language processing (Peters et al. 2018; Devlin et al. 2019; Howard and Ruder
2018), access is partially dense, and partially sparse and skewed: access to the
first (embedding) layer and sometimes the last (classification) layer is based
1Thus, with dense access, essentially, all parameters are hot spots.
2
However, sparsity and skew can be introduced to the training of such dense models by
specific training methods, e.g., by update sparsification (see Section 2.3.1).
66
4.1. Non-Uniform Parameter Access
on word or token frequency (and thus sparse and skewed), and access to other
layers is dense. The share of parameters with frequency-based access depends
on the model architecture, but can be high, e.g., around 90% in ELMo (Pe-
ters et al. 2018). In this chapter, we investigate skew in shallow models, but
conjecture that a non-uniform PS can also be beneficial for deeper models
with partially skewed access.
4.1.2 Sampling
A workload exhibits sampling non-uniformity when, for a subset of param-
eter accesses, random sampling determines which parameters are read and
written (Mikolov et al. 2013; Ruffinelli et al. 2020; H. Liu et al. 2017; Rendle
et al. 2009; Rawat et al. 2021; Bamler and Mandt 2020; Chechik et al. 2010;
Schroff et al. 2015). I.e., the application randomly draws a parameter key
from an application-specific sampling distribution over (all or a subset of)
parameter keys. It then accesses the drawn parameter for training. We refer
to such access as sampling access. In contrast, in direct access, the training
data determines which parameters are accessed. Sampling access is common
in many-class classification tasks, e.g., extreme classification (Bamler and
Mandt 2020), natural language processing (Mikolov et al. 2013), knowledge
graph embeddings (Ruffinelli et al. 2020; H. Liu et al. 2017), graph repre-
sentations (Grover and Leskovec 2016; Z. Yang et al. 2020), recommender
systems (Rendle et al. 2009), and when triplet loss is used (Chechik et al. 2010;
Schroff et al. 2015).
For example, knowledge graph embeddings and word embeddings train-
ing tasks often use negative sampling to enable efficient training (Mikolov
et al. 2013; Ruffinelli et al. 2020; Rawat et al. 2021). For each (positive) data
point, a set of negative samples is drawn from a distribution. Each negative
sample corresponds to a data item (e.g., a word) or a class. The corresponding
parameters are subsequently accessed for training. For instance, the example
knowledge graph embeddings task draws negative samples from a uniform
distribution over all entities (Ruffinelli et al. 2020; H. Liu et al. 2017). The
right hand side of Figure 4.2a shows the frequency distributions of direct
and sampling accesses separately for this task. In our implementation (based
on (H. Liu et al. 2017), see Section 4.4.1) and with 200 negative samples for
each subject–relation–object triple (100 negative samples for the subject and
another 100 for the object), sampling accesses make up 31% of all accesses.
In the word embeddings task, negative samples correspond to words and
the sampling distribution resembles the word frequencies in the training
data (Mikolov et al. 2013), see Figure 4.2b. In the plot for direct access, param-
67
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
eters that belong to the output layer of the task’s neural network are visually
distinct from the other parameters. The reason for this is that the task draws
samples only from the output layer, and parameters in the plot are sorted
by total access frequency. In our implementation (based on (Mikolov et al.
2013), see Section 4.4.1) and with 3 negative samples for each word–word pair,
sampling accesses make up 56% of all parameter accesses in this task.
4.2 Multi-Technique Parameter Management
In this section, we analyze the suitability of existing PSs for ML tasks with
skewed parameter access (Section 4.2.1) and argue that existing PSs are ineffi-
cient for managing skew because they employ one single management tech-
nique for all parameters. Based on this analysis, we propose multi-technique
parameter management and discuss NuPS’s implementation (Section 4.2.2).
4.2.1 Analysis of Common Parameter Management Techniques
As discussed in Section 2.2, several different techniques are used in PSs to
manage parameters. In the following, we analyze the suitability of existing
techniques for managing skew. We briefly recap each technique before starting
our analysis.
Classic Parameter Server
A classic PS allocates parameters to servers statically (e.g., via range parti-
tioning of the parameter keys) and uses no replication (Smola and Narayana-
murthy 2010; Ahmed et al. 2012; Mu Li et al. 2014a). Thus precisely one
server holds the current value of a parameter, and this server is used for all
pull and push operations on this parameter.
Analysis: The performance of a classic PS is limited for both hot
spots and long tail parameters.
The reason for this is that every parameter
access uses the network: it incurs network latency for two messages (to and
from the responsible server) and the parameter value is sent over the network
once (from the server to the worker in a pull operation, in the other direction
for a push operation). This network overhead is incurred for all parameters,
i.e., hot spot and long tail ones. For hot spots, the overhead is incurred many
times for a few parameters. In the long tail, the overhead is incurred a few
times for each of many parameters.
68
4.2. Multi-Technique Parameter Management
Replication Parameter Server
A replication PS replicates parameters and tolerates some amount of staleness
in the replicas (Ho et al. 2013; Yuzhen Huang et al. 2018; J. Jiang et al. 2017;
Dai et al. 2015; Cui et al. 2014). The SSP protocol creates a replica when a
parameter is accessed and uses this replica until the staleness bound is reached
(at which point the replica is terminated). The ESSP protocol also creates a
replica when a parameter is (first) accessed, but then maintains this replica
throughout the entire training task (by repeatedly propagating updates to the
nodes that hold replicas).
Analysis: A replication PS is efficient for hot spots, but its benefit
for the long tail is limited.
Replication reduces network overhead (com-
pared to a classic PS) if a replicated parameter value is used more than once
and multiple updates can be sent to the PS in aggregated form. Replication
further reduces access latency if a parameter value (within the acceptable
staleness bound) is already locally available when a read operation is issued.
Both is typically the case for hot spot parameters, even within relatively
tight staleness bounds (because hot spot parameters are accessed frequently
at each node). In contrast, long tail parameters are accessed infrequently. So
it is unlikely that a long tail parameter is accessed more than once within
reasonable staleness bounds (large staleness bounds commonly deteriorate
model convergence (Ho et al. 2013)). For the same reason, SSP (which creates
replicas on demand) does not reduce access latency for long tail parameters,
because replicas are mostly “cold”. With its eager replica maintenance, ESSP
ensures that replicas are always “warm” (after the first access to a parameter),
but at the cost of significant over-communication: ESSP constantly updates
all replicas, although replicas for long tail parameters are accessed rarely.
Relocation Parameter Server
A relocation PS such as Lapse (see Section 3.2) asynchronously re-allocates
parameters among nodes during run time so that access operations can be
processed locally, without network communication.
Analysis: A relocation PS is efficient for long tail parameters, but
has limited benefit for hot spots.
Relocation eliminates access latency if
there is sufficient time to relocate a parameter between accesses at different
nodes.
3
It further reduces network overhead (compared to classic) if a param-
eter is accessed more than once between two relocations (which is common,
3
We assume the general-purpose latency hiding technique here, i.e., that there is no explicit
locality through data clustering or parameter blocking. If there is explicit locality, parameter
relocation is highly efficient, as discussed in Section 3.1.2.
69
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
most ML tasks at least read and write a parameter): a relocation takes three
messages in Lapse (including the parameter value once, see Section 3.2.2),
whereas each remote access in a classic PS sends two messages (including the
parameter value once). There is typically sufficient time for relocating long
tail parameters between accesses by different nodes, as they are accessed infre-
quently. Hot spot parameters, however, are frequently accessed at multiple
nodes concurrently. Thus, there is not sufficient time for relocations between
accesses, such that access latency is not eliminated. Further, a relocation PS
incurs higher network overhead than a classic PS if a parameter is relocated
so frequently that only one operation is processed locally.
Summary
Individual management techniques are efficient for either hot spot or long
tail parameters (or neither of the two), but none is efficient for both. Conse-
quently, managing all parameters with the same technique limits the perfor-
mance of PSs for ML tasks with skewed parameter access.
4.2.2 Parameter Management in NuPS
From the above discussion, it follows naturally to explore whether combining
multiple management techniques is beneficial for PS performance. The idea
of combining multiple management techniques has been studied in other
distributed data management systems, such as general-purpose distributed
databases (Dowdy and Foster 1982; Wolfson and Jajodia 1992; El Abbadi 1991;
Ciciani et al. 1990), distributed graph processing systems (Low et al. 2012),
and PSs (S. Kim et al. 2019; Q. Zheng et al. 2021). However, existing systems
combine static allocation with replication, and do not consider relocation.
NuPS combines replication and relocation. First, to manage hot spot
parameters efficiently, NuPS integrates a lightweight variant of eager replica-
tion (Dai et al. 2015). NuPS eagerly creates replicas for hot spot keys on all
nodes and provides time-based staleness bounds. Basing the staleness bound
on time rather than clocks alleviates the need for adding “advance the clock”
operations to application code, but potentially complicates the analysis of
convergence properties. We discuss these implications below. Second, to
manage long tail parameters efficiently, NuPS integrates relocation. As Lapse,
NuPS asynchronously relocates these parameters before they are accessed,
thus guaranteeing per-key sequential consistency for long tail parameters.
The support for multiple management techniques in NuPS enables appli-
cations to pick a technique for each key based on the key’s access pattern:
if the key is accessed frequently, NuPS can replicate the key; if there are
70
4.2. Multi-Technique Parameter Management
few accesses, NuPS can employ relocation. During training, the choice of
management technique is transparent to the application, i.e., the application
accesses all parameters in the same way, via the
push
and
pull
primitives.
Our experimental evaluation shows that the combination of replication and
relocation can be highly beneficial. Integrating other techniques (e.g., highly
tailored ones) may further improve performance, but is beyond the scope of
this thesis. NuPS does not integrate the classic technique as it is dominated
by replication for hot spots and by relocation for the long tail.
Algorithm 4.1 gives an example for how the multi-technique parameter
management of NuPS can be used in the distributed SGD example that we
introduced in Section 2.2. The differences to a Classic PS implementation
are: (i) the application (i.e., the component that interacts with the PS) picks
a management technique for each parameter key (lines 1–2), and (ii) the
application initiates parameter relocation when a batch is prepared (line 8),
as done for Lapse (see Algorithm 3.1 on page 36). NuPS ignores requests
to relocate parameters that are managed by replication to ensure that the
management technique is transparent to the application.
Algorithm 4.1:
Distributed asynchronous SGD with NuPS. Differ-
ences to the Classic PS implementation (Algorithm 2.3 on page 17)
are highlighted in green.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
t: the ID of this worker thread,
T: the number of total worker threads ,
T: list of desired management techniques (one for each key)
1foreach kall_keys() do
2set_technique ( k,Tk)
3for epoch 1to num_epochs do
4b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
5B= []
6for i1to bdo
7Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
8localize ( keys(Bi) )
// training
9for i1to bdo
10 w=pull ( keys(Bi))
11 w=compute_update ( Bi,w)
12 push ( keys(Bi), w)
71
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
node 1
worker 1
worker 2
...
worker n
helper(s)
node 2
...
node 3
...
remote access,
replica sync.,
relocation
Allocation
allocated at this node
replicated at this node
not locally available
Figure 4.3: Parameter management in NuPS. NuPS replicates hot spots and re-
locates long tail parameters. It accesses replicated and current local parameters
via shared memory.
For efficiency, NuPS co-locates workers and servers in one process per
node (as Lapse does), and accesses replicas and locally allocated parameters
via shared memory. Figure 4.3 depicts an overview. To access a key, a worker
checks whether this key is managed by replication or relocation. If the key
is managed by replication, the worker accesses the key via shared memory,
without network communication. If the key is managed by relocation, the
worker checks whether the key is currently allocated locally. If so, it accesses
the key via shared memory. Otherwise, the worker accesses the parameter
remotely, using the message protocol used by Lapse: a request to the node
that knows where the parameter is currently allocated, which then forwards
the request to this node, which in turn processes the request and sends a
response to the worker (as described in Section 3.2.3).
NuPS is designed to minimize the run time overhead of providing mul-
tiple management techniques. To do so, NuPS integrates the check for the
management technique and the check for local allocation into one latch ac-
quisition (i.e., a lock held for the duration of the API call). Further, NuPS
can be reduced to a single-technique PS with no measurable run time over-
head for providing more than one management technique: If replication is
not used for any key, the replica synchronization background thread exits
immediately, without sending any messages. If relocation is not used for any
key, no messages are sent for relocation.
NuPS bases its staleness bounds on time rather than logical clocks be-
cause this makes the PS easier to use: time-based bounds alleviate the need
for adding “advance the clock” operations to application code and for tim-
ing them appropriately. NuPS synchronizes the replicas periodically, using
sparse all-reduce operations (i.e., only updated parameters are exchanged
(Träff 2010)). The synchronization is run by a background thread and uses
72
4.3. Sampling Management
recursive doubling all-reduce (Thakur and Gropp 2003). However, time-
based staleness bounds potentially complicate the analysis of convergence
properties. If only a bounded number of SGD steps can occur within one
synchronization round, bounded staleness holds (as for clock-based staleness
bounds) and the corresponding analysis carries over (Ho et al. 2013). How-
ever, if the number of SGD steps within one synchronization round cannot be
bounded, convergence analyses for asynchronous SGD apply (Lian et al. 2015;
X. Zhang et al. 2018). In our experiments, the effect of time-based bounds
on performance was minimal because we used replication only for a small
number of parameters and synchronized replicas frequently (see Section 4.4.6
and Section 4.4.7).
In NuPS, the decision between replication and relocation is static. I.e.,
the application chooses one management technique for each parameter at the
beginning of training and then uses this technique throughout training. This
static decision is a first step towards exploring and evaluating multi-technique
parameter management. One limitation of a static decision is that NuPS
cannot support approaches that require switching techniques dynamically
during run time. Based on our findings with NuPS in this chapter, we will
takea furtherstepandexploredynamic techniqueswitchingin thesubsequent
chapter (Chapter 5).
4.3 Sampling Management
Existing PSs provide no support for sampling. This means that applications
manually sample keys and then access the corresponding parameters via direct
access, which leads to significant communication overhead. To reduce this
overhead, many applications implement a variety of sampling schemes (Lerer
et al. 2019; D. Zheng et al. 2020b; Ji et al. 2019; Stergiou et al. 2017). The key
idea of such sampling schemes is that slightly (or sometimes rather signifi-
cantly) deviating from the ideal of independent sampling from the desired
target distribution might have only little or no effect on model quality, but
can reduce communication overhead substantially (and consequently speed
up model training). The lack of sampling support in PSs forces applications
to implement such schemes in application code, outside the PS. This leads to
repeated implementation effort and potentially produces incorrect samples.
Further, this precludes sampling schemes that require tight integration with
parameter management.
In contrast to existing PSs, NuPS integrates sampling directly into the
PS. In the following, we present the components of this integration. We
first introduce a set of conformity levels that allow for a controlled trade-off
73
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
between efficiency and sample quality (Section 4.3.1). We then analyze confor-
mity and communication overhead of sampling schemes that are commonly
used by applications (Section 4.3.2). Based on this analysis, we propose an
API extension that enables sampling in PSs (Section 4.3.3) and discuss how
NuPS implements several sampling schemes, within this API (Section 4.3.4).
4.3.1 Sampling Conformity Levels
Let
π
be a target distribution over parameter keys. We assume that
π
is
specified by the application and remains fixed throughout run time.
4
For
example, in the word embeddings training task of Section 4.1.2, the target
distribution
π
roughly corresponds to relative word frequencies (Mikolov
et al. 2013); cf. Figure 4.2b. When training knowledge graph embeddings,
π
is often a uniform distribution over all entities (Ruffinelli et al. 2020);
cf. Figure 4.2a. Denote by
K
the set of parameter keys and by
πk
0 the
target probability for key
k K
, where
P|K|
k=1πk
=1. Workers repeatedly
drawoneormoresamplesfromthetargetdistribution
π
. Denoteby
Xqi K
a random variable for the
i
-th sample obtained at node
q
.
5
We write
Nq
for
the number of samples drawn at node
q
during the complete run time of
some application. Set Xq={Xq1,...,XqNq}and X=SqXq.
We propose a hierarchy of four sampling conformity levels to control
the trade-off between sample quality and efficiency. From the top (L1) to
the bottom (L4) of this hierarchy, sample quality decreases, and potential
efficiency increases:
(L1) CONFORM.
The sampling scheme produces mutually independent sam-
ples from the target distribution π. I.e.,
p(Xqi =k|S ) = πk
for all q,i,kand S X \ ¦Xqi ©.
(L2) BOUNDED.
The samples at each node have dependencies on past samples,
but these dependencies are limited and samples at different nodes are
independent. In more detail, given a dependency bound B N, it holds
p(Xqi =k|S B
q,Sq) = πk
for all
q,i,k
, where
Sq X \Xq
refers to samples at other nodes and
4
This is mainly to facilitate analysis; an application may use multiple different sampling
distributions, each of which can be analyzed separately.
5
Depending on the implementation, there can be multiple workers on each node. We
analyze sampling schemes at the node level to simplify exposition.
74
4.3. Sampling Management
SB
q {Xq1,...,Xq(iB1)}
refers to the samples at node
q
except the
last
B
samples. Note that first-order inclusion probabilities match the
target probabilities—i.e.,
p
(
Xqi
=
k
) =
πk
—even though subsequent
samples may be dependent. For example, a sampling scheme that inter-
nally draws independent samples from
π
but uses each sample twice is
BOUNDED with B=1.
(L3) LONG-TERM.
The mean first-order inclusion probabilities match the tar-
get probabilities asymptotically at each node, i.e.,
lim
Nq→∞
1
Nq
Nq
X
i=1
p(Xqi =k|Xq1,...,Xq(i1)) = πk(4.1)
for all
q,k
. Note that this does not imply
p
(
Xqi
=
k
) =
πk
. Also,
arbitrary dependencies between samples within one or across multiple
nodes are accepted as long as the asymptotic relative frequencies of the
samples match the target. For example, a sequential sampling scheme
that selects a random key order for the
|K|
keys and then draws samples
in a round-robin fashion satisfies
LONG-TERM
but not
BOUNDED
: each key
is selected equally often in the long run, but the knowledge of the first
|K|
samples allows to uniquely determine all future samples, so that no
dependency bound can be established.
(L4) NON-CONFORM.
No guarantees about the sampling probabilities or inde-
pendence.
The levels are hierarchical in that L1 implies L2, and L2 implies L3. The first
implication follows since we can set
S
=
SB
qSq
for any choice of
SB
q
and Sq.
Proof (L2 implies L3).
Starting from some offset 1
oB
, fix some node
q
and consider the subset
¦Xq(aB+o)©aN
of every
B
-th sample on node
q
,
starting from the o-th sample. Using the definition of BOUNDED, we obtain
1
b(Nqo)/Bc
b(Nqo)/Bc
X
a=1
p(Xq(aB+o)=k|Xq1,...,Xq(aB+oB)) = πk
for any choice of
Nq
, i.e., the long-term relative frequencies of every
B
-th
sample match if we start at offset
o
. Since this holds for every offset
o
, we
conclude that Eq. (4.1) holds and L2 implies L3.
Note that we defined L3 via Eq.
(4.1)
rather than a simpler first-order
probability condition such as
p
(
Xqi
=
k
) =
πk
, because correct first-order
75
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
Table 4.1: Conformity levels of common sampling schemes.
L1 L2 L3
CONFORM BOUNDED LONG-TERM
Independent sampling Ø Ø Ø
Sample reuse ר Ø
Local sampling × × ×
Direct-access repurposing × × ×
conditions are not sufficient to ensure that a sampling scheme is useful in prac-
tice. For example, a sampling scheme that internally draws one independent
sample
X
from
π
, and then uses solely this sample throughout (i.e.,
Xqi
=
X
for all q,i) satisfies such a condition, but is clearly unsuitable in practice.
4.3.2 Analysis of Common Sampling Schemes
ML applications employ a variety of sampling schemes. In the following,
we analyze schemes that are common in distributed training (Lerer et al.
2019; D. Zheng et al. 2020b; Ji et al. 2019; Stergiou et al. 2017; Kochsiek and
Gemulla 2021) with respect to their effect on (i) communication overhead
and (ii) sampling quality, i.e., into which conformity level they fall. Table 4.1
provides an overview of the latter. In this section, we focus on theoretical
analyses, an empirical evaluation follows in Section 4.4.5.
Independent Sampling
Ideally, applications draw i.i.d. samples from the target distribution and
use each sample once. This scheme is
CONFORM
, but can lead to significant
communication overhead: for each sample, the corresponding parameter
values need to be transferred to the node and, after an update is computed,
updates need to be propagated to other nodes.
Sample Reuse
Sample reuse reduces communication overhead by using each sample multiple
times (Ji et al. 2019; D. Zheng et al. 2020b; Lerer et al. 2019; Broscheit et al.
2020). For example, knowledge graph embeddings training can use shared
sampling, i.e., reuse negative samples for all positive examples in a mini-
batch (Broscheit et al. 2020). Reusing a sample multiple times avoids the
transfer of parameter values for another, fresh sample: using a sample
U
times can reduce the communication overhead by a factor of
U
. We refer to
this factor as the use frequency and to a sample reuse scheme that uses each
sample
u
times as U=u sample reuse. Sample reuse does not provide
CONFORM
76
4.3. Sampling Management
since samples are not independent. However, it can provide
BOUNDED
. For
example, if each fresh sample is sampled i.i.d. from
π
and then used exactly
U
times, then the scheme is
BOUNDED
for all
BU
. Moreover, in mini-batch
negative sample reuse as in (Ji et al. 2019; Lerer et al. 2019; Broscheit et al.
2020),
BOUNDED
also holds. Here samples are reused only within one mini-
batch of gradient descent so that the mini-batch size provides a bound on the
sample dependency.
Local Sampling
In many distributed ML architectures (Ho et al. 2013; Dai et al. 2015; Yuzhen
Huang et al. 2018), at each node, a distinct subset of the model parameters—
the local partition—can be accessed without network communication. Local
sampling restricts sampling accesses to this local partition (D. Zheng et al.
2020b). This scheme eliminates network overhead for sampling accesses
entirely. However, local sampling is
NON-CONFORM
as nodes see only samples
from the local partition. Some implementations re-partition parameters
periodically such that all nodes at least see all samples over time (D. Zheng
et al. 2020b). Careful re-partitioning might satisfy Eq.
(4.1)
for certain target
distributions; e.g., if
π
is uniform and parameters are allocated uniformly and
at random. In general, however, local sampling cannot provide
LONG-TERM
(i.e., local sampling is only
NON-CONFORM
). For example, consider any target
distributions in which
πk>
1
/Q
for some
k
(with
Q
being the number of
nodes). Local sampling cannot satisfy Eq.
(4.1)
for such a target since key
k
is available for sampling at only one node at a time. This implies that there is
at least one node at which the long-term frequency of kis 1/Q.
Direct-Access Repurposing
Another sampling scheme is to repurpose direct-access parameters, i.e., to
use them as negative samples. For example, DGL-KE (D. Zheng et al. 2020b)
generates some of the samples by repurposing parameters that occur as posi-
tives in other data points of an SGD mini-batch. This requires no additional
communication for sampling accesses, as the values for the direct access pa-
rameters are transferred to the node either way. In this scheme, the relative
frequency of a seeing a key in a sample depends on the occurrence frequency
of the key in the training data. As the training data occurrence distribution
can be (and typically is (Broscheit et al. 2020; Mikolov et al. 2013)) different
from the target distribution, this scheme is NON-CONFORM.
77
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
4.3.3 A Primitive for Sampling
It is impossible for PSs to integrate these sampling schemes within the push
and
pull
PS API. The main problem is that sampling is done by application
code: to conduct a sampling access, an ML application draws a sample of keys
and accesses them via
pull
or
push
. For instance, this makes it impossible
for the PS to restrict sampling to the local partition. Further, the PS cannot
even distinguish between direct access (for which it cannot leverage sampling
schemes) and sampling access (for which it can leverage sampling schemes).
To overcome these limitations, we propose to extend the PS API with
a sampling primitive that allows applications to access a sample from a tar-
get distribution, under a specific sampling conformity level. The sampling
manager in NuPS transparently chooses a sampling scheme that conforms
with the chosen conformity level and applies the scheme for all sampling
accesses. We propose one operation
dist = register_distribution(π,
L)
to register a specific sampling distribution
π
under a specific sampling
conformity level L, and a combination of two operations to draw samples:
handle =prepare_sample(dist,N)
keys,values =pull_sample(handle[,nj])
The argument
N
is the number of desired samples. The
prepare_sample
operation is intended to return instantaneously (and run preparatory work in
the background), whereas
pull_sample
blocks if called synchronously. After
pull_sample
returns, the corresponding keys are stored in
keys
and corre-
sponding values are copied to
values
. Applications can call
pull_sample
once to obtain all
N
samples at once or multiple times to obtain the
N
sam-
ples in smaller portions (by passing
n0,n1,... <N
to multiple invocations
of
pull_sample
such that
Pnj
=
N
). Such partial pulls give the PS more
flexibility, and, thus, may result in better performance.
Algorithm 4.2 illustrates how the sampling primitive could be used in
the distributed SGD example of Section 2.2. The worker registers the desired
sampling distributing (line 1), requests a sample when the batch is prepared
(line 8),
6
and pulls the sample keys and values during training (line 10). It
then uses the sample keys and values together with the direct access keys and
values (lines 11–13). In the algorithm, we use
to depict concatenation (of,
e.g., two vectors).7
This extension provides sufficient flexibility for implementing a wide
range of sampling schemes, as we describe in the following Section 4.3.4.
6
The
num_samples(Bi)
functiondeterminesthenumberofsamplesrequiredinbatch
Bi
.
7
For example, with two vectors
a
=
0 1
and
b
=
2 3 4
, we have
ab
=
0 1 2 3 4
.
78
4.3. Sampling Management
Algorithm 4.2:
Sampling support in distributed asynchronous
SGD. Differences to the Classic PS implementation (Algorithm 2.3
on page 17) are highlighted in green. We write
to depict concate-
nation (e.g., of two vectors).
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
t: the ID of this worker thread,
T: the number of total worker threads ,
π: a sampling distribution,
L: a sampling conformity level
1dist =register_distribution (π,L)
2for epoch 1to num_epochs do
3b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
4B= []
5S= []
6for i1to bdo
7Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
8Si=prepare_sample ( dist, num_samples(Bi))
// training
9for i1to bdo
10 keyssamples,wsamples =pull_sample ( Si)
11 w=pull ( keys(Bi))wsamples
12 w=compute_update ( Bi,w)
13 push ( keys(Bi)keyssamples,w)
79
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
prepare
sample
pull
sample
Back-
ground
thread
Independent
sampling
(CONFORM)
sample i.i.d. from
πand localize
(async)
pull parameters
(remotely
if necessary)
Sample
reuse
(BOUNDED)
re-localize if
necessary
(async)
pull parameters
(remotely
if necessary)
fill pool:
sample i.i.d. from
πand localize
Sample reuse
with postponing
(LONG-TERM)
re-localize if
necessary
(async)
pull parameters
if local, o/w
postpone the
sample
fill pool:
sample i.i.d. from
πand localize
Local
sampling
(NON-CONFORM)
sample from
locally available
part of π
and pull locally
Figure 4.4: Sampling scheme implementations in NuPS.
The extension derives its flexibility from three key design choices. First,
the extension transfers sampling from the application to the PS. Second, the
extension provides the PS with a hook for doing preparatory work, such as
pre-fetching parameter values, modifying partitions, or coordinating among
nodes. Third, the extension does not force final decisions (e.g., about the
sampled keys) before pull_sample returns.8
4.3.4 The Sampling Manager in NuPS
The sampling manager is responsible for generating samples and managing
the corresponding parameters. The sampling manager of NuPS currently
supports four sampling schemes behind the sampling API. Figure 4.4 provides
an overview. Schemes implement
prepare_sample
and
pull_sample
, and
optionally a background thread. From the four implemented schemes, the
sampling manager picks a scheme that is suitable for the specified conformity
level. We now discuss the schemes in turn.
Independent Sampling (CONFORM)
In this scheme, NuPS samples i.i.d. from the target distribution and localizes
the corresponding parameters in
prepare_sample
(such that they can be ac-
cessed locally when
pull_sample
is called). In
pull_sample
, NuPS accesses
parameters remotely if they have been relocated to another node between the
invocation of
prepare_sample
and the invocation of
pull_sample
(this can
8
For this reason, the
prepare_sample
operation returns a handle rather than the parame-
ter keys directly.
80
4.3. Sampling Management
happen because other nodes can independently work on the same parameters).
This approach is CONFORM because each worker samples i.i.d. from π.
Sample Reuse (BOUNDED)
NuPS implements a sample reuse scheme that reuses pools of keys. The
pooling increases the temporal distance between the reused samples and
thereby increases randomness. For a given pool size
G
and use frequency
U
,
NuPS repeatedly samples
G
keys i.i.d. from
π
to form a sample pool and
produces samples by traversing the sample pool
U
times, each time in a
random order. For example, consider
U
=2 and suppose that the i.i.d. draws
produce keys
k1
,
k2
, and
k3
, respectively. With
G
=1, we obtain sample
sequence
k1k1k2k2k3k3
. With
G
=3, a sequence such as
k1k2k3k2k1k3
is
possible. The pools are prepared by a separate background thread. When
the background thread generates a new pool, it localizes the corresponding
parameters. NuPSre-localizestheparametersin
prepare_sample
iftheyhave
been relocated to another node since pool preparation. In the
pull_sample
operation, NuPS accesses the parameters remotely if necessary. This sample
reuse scheme is
BOUNDED
because samples are drawn i.i.d. from the target
distribution
π
, inter-sample dependency is bounded by
U·G
, and
U
is
identical for all samples.
The background thread determines automatically when to prepare a
new pool. Adding a new pool takes time and (for good performance) the
localization should be finished when
pull_sample
is called. This time de-
pends on the ML task, the used hardware, and the system configuration. To
estimate this time, we use a heuristic. Note that while the heuristic may affect
performance, it does not affect correctness. In particular, the background
thread keeps track of the duration of previous pool relocations. If the number
of prepared, but unused samples is less than double of the current estimated
relocation time, the preparation of another pool is triggered.
Sample Reuse With Postponing (LONG-TERM)
NuPS additionally implements sample reuse with sample postponing. This is
identical to the described sample reuse scheme, but adds sample postponing:
if sample
i
cannot be accessed locally in
pull_sample
, NuPS re-localizes
the corresponding parameters, postpones sample
i
for later use, and uses
sample
i
+1 instead. To achieve
LONG-TERM
, it is crucial that, at some point,
samples are used (and not re-postponed indefinitely).
9
Thus, NuPS postpones
9
If samples could be re-postponed indefinitely, some samples may never be used because
they are constantly being relocated. In such cases, Eq. (4.1) would not hold.
81
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
only within the
N
samples of one invocation of
prepare_sample
(in other
words, only within the samples of one handle). I.e., when NuPS finds a
non-local sample (in
pull_sample
), it moves the sample to the end of the
N
samples of this handle. NuPS postpones each sample maximally once. When
it reaches samples that it has already postponed once (towards the end of
the
N
samples), it accesses them remotely if necessary. This implementation
of postponing reduces communication overhead only if the samples of one
handle are pulled in groups smaller than
N
and there is some time between
these partial pulls for the parameter relocation. Assuming that
N
is bounded
from above, it provides
LONG-TERM
. It does not provide
BOUNDED
because
sampling probabilities depend on the current allocation of a key (i.e., keys
can be postponed to a later sample if they are not local).
Local Sampling (NON-CONFORM)
NuPS implements local sampling without active re-partitioning. Instead,
NuPS relies on the application to relocate parameters: in a relocation PS, the
local partition usually changes constantly, as workers relocate the parameters
that they work with (in direct access). The effect of this local sampling
variant heavily depends on the relocations of the application. Generally,
this approach cannot give any guarantees, as, for example, an application
might not relocate parameters at all. Consequently, it generally falls into the
NON-CONFORM
level. In an ideal setting, however, this approach could provide
LONG-TERM
. For example, this can be the case if an application partitions its
training data randomly and continuously relocates all parameters (such that
a parameter is equally likely to be on all nodes) and samples uniformly (such
that
πk1
Q
for all
k
). To make local sampling efficient, NuPS employs a
fast sampling implementation that does not sample independently.
4.4 Experiments
We conducted an experimental study to investigate whether and to what
extent non-uniformity is beneficial for PS performance. Our source code,
datasets, and information on reproducing experiments are available online.
10
In this study, we compared the performance of NuPS to several state-
of-the-art PSs on three large-scale ML tasks (Section 4.4.2). Further, we
conducted an ablation study (Section 4.4.3), investigated the scalability of
different approaches (Section 4.4.4), evaluated different sampling schemes
(Section 4.4.5), and explored specific components of NuPS (Sections 4.4.6
10https://github.com/alexrenz/NuPS/tree/sigmod22
82
4.4. Experiments
Table 4.2: ML tasks, models, and datasets.
Task Model parameters Data
Model Keys Values Size Dataset Data points Size
Knowledge graph embeddings ComplEx, dim. 500 4.8M 4.8B 35.9GB Wikidata5M 21M 317MB
Word embeddings Word2Vec, dim. 1000 1.9M 1.9B 7.0GB 1b word benchm. 375 M 3GB
Matrix factorization Latent factors, rank 1000 11.0M 11B 82.0GB 10m×1m matrix1000 M 31GB
and 4.4.7). Our major insights are: (i) NuPS was more than an order of
magnitude faster than existing PSs. (ii) NuPS achieved best performance
when it replicated a small fraction of the model parameters, and relocated
all other parameters. (iii) Both sample reuse and local sampling significantly
reduced communication overhead for sampling access. We conclude that
a non-uniform PS is key for high performance in ML tasks with non-uniform
parameter access.
4.4.1 Experimental Setup
Tasks
We considered three popular ML tasks that require long training: knowledge
graph embeddings (KGE), word embeddings (WE), and matrix factorization
(MF). The tasks differ in multiple ways, including the number of parame-
ters, parameter access distributions, sampling distribution, and frequency of
sampling accesses. Table 4.2 provides a summary of the models and datasets.
Table 4.3 depicts the share of direct and sampling access for each task. These
tasks are similar to the ones used in Section 3.3, but with some important
differences. The tasks in this section are “harder” to distribute efficiently than
the ones in Section 3.3 because they contain less explicit locality, more skew,
and we use faster compute hardware and higher degrees of parallelism (see
a detailed discussion at the end of this section). In the following, we briefly
discuss each of the three tasks.
Knowledge graph embeddings
ThisKGEtask, based on(H.Liuetal.2017),
trains ComplEx (Trouillon et al. 2016) (one of the most popular KGE
models) embeddings using SGD with AdaGrad (Duchi et al. 2011) and
negativesampling(Ruffinellietal.2020; H. Liuetal.2017). Negativesam-
pling creates sampling access in this task: to generate negative samples,
both the subject and the object entity of a positive triple are perturbed
nneg
times, by drawing random entities from a uniform distribution
over all entities (we used a common setting of
nneg
=100 (Ruffinelli et al.
2020)). We used the Wikidata5M dataset (X. Wang et al. 2019), a real-
83
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
Table 4.3: Share of direct and sampling access for each ML task.
Task Parameter access
Direct Sampling
Knowledge graph embeddings 69% 31%
Word embeddings 44% 56%
Matrix factorization 100% 0%
world knowledge graph with
4818679
entities and
828
relations, and a
common embedding size of
500
(Ruffinelli et al. 2020). We partitioned
the subject–relation–object triples of the dataset to the nodes randomly,
as done in (Kochsiek and Gemulla 2021). We used LibKGE (Broscheit
et al. 2020) (commit 3146885) to evaluate models and report the mean
reciprocal rank (filtered) (MRRF) as metric for model quality.
Word embeddings
This WE task, based on (Mikolov et al. 2013), uses SGD
and negative sampling to train the skip-gram Word2Vec (Mikolov et
al. 2013) model (dimension
1000
) on the One Billion Word Bench-
mark (Chelba et al. 2013) dataset (with stop words of the Gensim (ˇ
Re-
h˚
uˇ
rek and Sojka 2010) stop word list removed). The negative sampling
creates sampling accesses: in this task, for each word pair, 3 negative sam-
ples are drawn from a distribution that is based on word frequencies (see
Section 4.1.2). We used common model parameters (Mikolov et al. 2013)
for window size (5), minimum count (1), and frequent word subsampling
(0.01). We measured model accuracy using a common reasoning task of
19544 semantic and syntactic questions (Mikolov et al. 2013).
Matrix factorization
This MF task, based on (Teflioudi et al. 2012), uses
SGD to factorize a synthetic, zipf-1.1 distributed 10
m×
1
m
dataset
with 1b revealed cells, modeled after the Netflix Prize dataset.
11
We
generated this dataset to model real-world datasets more closely than the
uniformly distributed datasets that we adapted from (Makari et al. 2015)
for our study in the previous chapter (see Section 3.3). Data points were
partitioned to nodes by row and to workers within a node by column.
Each worker visited its data points by column (to create locality in
column parameter accesses), with random order of columns and of data
points within a column. There is no sampling access in this task. We
report the root mean squared error (RMSE) on the test set as metric for
model quality.
11
See
https://netflixprize.com/
. We use a synthetic dataset because the largest openly
available dataset that we are aware of is only 7.6GB large.
84
4.4. Experiments
Baselines
We compared performance to a classic PS, to Petuum (a state-of-the-art repli-
cation PS), to Lapse (the state-of-the-art relocation PS), and to a single node
implementation. As classic PS, we used Lapse with relocation disabled, which,
as seen in Section 3.3.2, provides performance similar to PS-Lite. We ran
both the SSP and ESSP protocols of Petuum (Xing et al. 2015), with different
staleness thresholds. Petuum does not provide KGE or WE implementations.
Thus, we implemented the KGE task described above in Petuum. We used
version 1.1 of Petuum. We did not implement specific sampling schemes in
application code, i.e., applications draw independent samples and access them
via direct access. We used a shared memory implementation with 8 worker
threads as single node baseline.
Implementation and Cluster
We implemented NuPS in C++, using ZeroMQ and Protocol Buffers for
communication, based on PS-Lite (Mu Li et al. 2014a) and Lapse. We used
a local cluster of up to 16 Lenovo ThinkSystem SR630 computers, running
Ubuntu Linux 20.04, connected with 100 Gbit Infiniband. Each node was
equipped with two Intel Xeon Silver 4216 16-core CPUs, 512 GB of main
memory, and one 2 TB D3-S4610 Intel SSD. We compiled code with g++
9.3.0, except for Petuum, which we compiled with g++ 7.5.0, as the compila-
tion with g++ 9.3.0 and g++ 8.4.0 failed. Unless specified otherwise, we used
8 nodes and 8 worker threads per node. In Lapse and NuPS, we additionally
used 1 server and 3 ZeroMQ I/O threads per node. In Petuum, we used 4
communication channels per node. To prevent exploding gradients, we used
gradient norm clipping as suggested in (Pascanu et al. 2013) for replicated
parameters in the WE and MF tasks (clipping updates that exceed the average
norm by more than 2x). In the KGE task, the use of AdaGrad prevented
exploding gradients. For each task, we tuned hyperparameters on the single
node and used the best found hyperparameter setting throughout all systems
and variants.
NuPS
We ran NuPS in two configurations: (i) a generally applicable untuned con-
figuration that requires no task-specific tuning and (ii) a task-specific tuned
configuration. The untuned configuration employs a heuristic to decide the
management technique for each parameter: it replicates a parameter if its
access frequency exceeds 100 times the mean access frequency. This heuristic
is computed from dataset frequency statistics. The untuned configuration
85
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
further employs sample reuse without postponing (
BOUNDED
) with a use fre-
quency of U=16. To indicate the performance potential of task-specific
insights, we included a tuned configuration by informing our configuration
choices with the results of our detail experiments in Sections 4.4.5 and 4.4.6.
The tuned configuration for KGE replicates the 900 most frequently accessed
keys (the same as the untuned setting), but uses local sampling (
NON-CONFORM
).
The tuned configuration for WE replicates the
209k
most frequently accessed
keys (64x more keys than the untuned configuration), and employs local
sampling (
NON-CONFORM
). For MF, the untuned configuration seemed to be
near-optimal, such that we did not add a separate tuned configuration. Unless
mentioned otherwise, we used the settings of the untuned configuration and
a replica staleness threshold of
40ms
in all experiments. Throughout all
experiments, we used a pool size of 250 in the sample reuse scheme.
Measures
Unless noted otherwise, we ran all variants with a fixed
6h
time budget. We
measured model quality over time and over epochs within this time budget
(using the quality metrics described above). As in Section 3.3, we conducted 3
independent runs of each experiment, each starting from a distinct randomly
initialized model, and report the mean. We depict error bars for model quality
and run time; they present the minimum and maximum measurements. In
some experiments, error bars are not clearly visible because of small variance.
Gray dotted lines indicate the performance of the single node baseline. Gray
shading indicates performance that is dominated by the single node baseline.
We report two types of speedups: (i) raw speedup depicts the speedup in epoch
run time, without considering model quality; (ii) effective speedup is calculated
from the time that each variant took to reach 90% of the best model quality
that the single node baseline achieved. Unless specified otherwise, we report
effective speedups.
Differences to the Experiments in Chapter 3
Overall, the experimental setup of this study seems similar to the setup in
Section 3.3. However, there is a series of important differences that make
the tasks in this section harder to scale. First, we used simpler (more general-
purpose) algorithms in this study, with (much) less explicit locality. In par-
ticular, the KGE and WE tasks rely exclusively on latency hiding. In the
MF task, no parameter blocking approach is used. Instead, the data points
were partitioned by rows and latency hiding is used otherwise. In contrast,
in Section 3.3, we evaluated whether PSs could support algorithms that delib-
86
4.4. Experiments
erately use PAL techniques to create locality (e.g., the DSGD (Gemulla et al.
2011) parameter blocking approach for MF). Simpler algorithms require less
work for from the application developer: e.g., there is no need to develop a
blocking approach (MF) or to partition the dataset (KGE).
Second, we examined performance for higher degrees of parallelism in
this study. We doubled both the number of worker threads per node and the
maximum number of nodes. A higher number of workers leads to a higher
chance of localization conflicts (as already observed in Section 3.3).
Third, we used more recent cluster hardware. In particular, the CPUs
of these machines were much faster (around 2x faster for our tasks). This
made single node baselines significantly faster and thus makes it harder to
“hide” communication overhead. (On the other hand, the new cluster is also
equipped with faster network hardware.)
Fourth, the datasets in this study are more skewed. In MF, we used a
dataset that more realistically models real-world datasets, which means it is
skewed. And in KGE, the larger Wikidata5M dataset also happens to be more
skewed than the smaller DBpedia-500k dataset in the previous study.
4.4.2 Overall Performance
We investigated the overall effect of a non-uniform PS on PS performance. To
do so, we compared the performance of NuPS to existing PSs and to the single
node baseline. We ran each variant for the fixed time budget and measured
model quality over this time. Figures 4.5a, 4.5c, and 4.5e show model quality
over time, Figures 4.5b, 4.5d, and 4.5f show model quality over epoch.
In
summary, NuPS was 31–36x faster than a state-of-the-art replication PS
(Petuum), 6–46x faster than Lapse (the state-of-the-art relocation PS),
and 2.3–10.3x faster than the single node baseline.12
Classic PS and Lapse
The classic PS was inefficient (with epochs over 7x slower than the single
node) because it accesses parameters over the network, which induced sig-
nificant access latency. Lapse was faster than Classic, but still slower than
the single node, because Lapse relocates all parameters, including hot spots.
Hot spot parameters, however, are frequently accessed by multiple nodes
simultaneously, such that some of these nodes had to wait for the relocation
to finish or access the parameter remotely, which induced access latency. The
key reasons for Lapse performing worse in this study than in our previous
12
The comparisons to Petuum and Lapse report raw speedups, because Petuum and Lapse
did not reach the 90% thresholds within the time budget.
87
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
Single node Classic Petuum (SSP) Petuum (ESSP)
Lapse NuPS (untuned) NuPS
7h🠞🠞
13h🠞🠞
15h🠞🠞
38h🠞🠞
6.7x speedup
(6.6x eective)
6.9x speedup (10.3x eective)
0.00
0.05
0.10
0.15
0.20
0 2 4 6
Run time (hours)
MRR (ltered)
(a) KGE (quality over time)
0.00
0.05
0.10
0.15
0.20
0 1 2 3 4 5 6 7 8 9 10
Epoch
MRR (ltered)
(b) KGE (quality over epoch)
Petuum: no impl.
16h🠞🠞
10h🠞🠞
2.2x speedup
(2.3x eective)
6.7x speedup (6.4x eective)
0
20
40
60
0 2 4 6
Run time (hours)
Accuracy
(c) WE (quality over time)
0
20
40
60
012345678
Epoch
Accuracy
(d) WE (quality over epoch)
6.5x speedup (3.8x eective)
Petuum: out of memory
2.00
1.00
0.50
0.25
0 2 4 6
Run time (hours)
RMSE on test data
(e) MF (quality over time)
2.00
1.00
0.50
0.25
0 20 40 60 80 100
Epoch
RMSE on test data
(f) MF (quality over epoch)
Figure 4.5: End-to-end performance of different PSs on 8 nodes. NuPS out-
performed Petuum and Lapse by up to one order of magnitude. The gray
shaded area indicates performance that is dominated by the single node base-
line. The dashed gray line depicts the model quality threshold at which effec-
tive speedups are computed.
88
4.4. Experiments
study in Section 3.3 are (i) that we used simpler algorithms that exhibit less
locality,
13
(ii) that we used more worker threads, (iii) that the compute of the
used cluster is faster (i.e., each thread is faster individually), and (iv) that the
tasks contain more skew than in our previous study.
The per-epoch model quality of Classic and Lapse was indistinguishable
from the single node in KGE and WE, as these systems provide sequential
consistency for all parameters and employ no specialized sampling schemes.
In MF, all distributed variants provided lower per-epoch quality than the
single node, an effect that has been observed before (Makari et al. 2015).
The step pattern that is visible in MF training stems from the bold driver
heuristic (Battiti 1989). The MF implementation (Makari et al. 2015) that we
adapted for our experiments uses this heuristic to tune the learning rate.
Petuum SSP and ESSP
For KGE, we ran Petuum SSP and ESSP with staleness thresholds 1, 10,
100, 200, or
1000
, and tried different frequencies for advancing the clock.
14
None of the configurations completed the first epoch within the time budget
of 6 hours. We observed the best performance for ESSP with staleness 10,
which finished the first epoch after 13h with a model quality (MRRF) of
0.11. The best SSP run (staleness 200) finished the first epoch after 15h with
a model quality of 0.10. The reasons for this performance are that Petuum
is inefficient for long tail parameters (as discussed in Section 4.2.1) and that
Petuum’s replica approach is inefficient for sampling because sampling access
provides no locality: SSP replicas are mostly cold, ESSP over-communicates.
Petuum’s MF implementation ran out of memory, because it stores the
training matrix in dense format.
NuPS
The untuned NuPS configuration outperformed existing PSs across all three
tasks. For KGE and MF, it was also clearly faster than the single node, with
up to 6.7x effective speedups over the single node and minimal negative effect
on (per-epoch) model quality. For WE, however, it barely outperformed
the single node (but still outperformed existing PSs). In contrast, the tuned
configuration provided 4.6–10.3x effective speedups over the single node
across all three tasks. For KGE, the tuned configuration of NuPS provided
13
Such simpler algorithms are harder to scale for the PS, but require less additional effort
from application developers.
14
We tried to advance the clock after every 1, every 10, and every 100 data points. We
observed best performance for clocking after every 10th data point. Due to the high run
times of Petuum, we ran each configuration only once.
89
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
Relocation (Lapse) Relocation + Replication
Relocation + Sampling NuPS (untuned)
0.00
0.05
0.10
0.15
0 2 4 6
Run time (hours)
MRR (ltered)
(a) KGE
0
20
40
60
0.0 2.5 5.0 7.5 10.0
Run time (hours)
Accuracy
(b) WE
Figure 4.6: Ablation. Both (i) combining replication and relocation and (ii)
integrating specialized sampling access management techniques improved per-
formance individually, and it was beneficial to combine the two.
better per-epoch convergence than the single node. This was an effect of local
sampling; see Section 4.4.5 for more details.
4.4.3 Ablation
NuPS introduces two novel features compared to existing PSs: (i) multi-
technique parameter management and (ii) direct sampling support. To inves-
tigate individual effects, we enabled each feature individually and measured
model quality within the time budget. Figure 4.6 shows the results. We omit
MF because it contains no sampling access, such that the entire performance
improvement stems from multi-technique parameter management (which
is visible in Figure 4.5e).
We found that both multi-technique parameter
management and sampling integration can be beneficial individually,
and the individual benefits compounded when both were combined.
We compared the performance of four variants: (i) Lapse, a relocation
PS without sampling integration; (ii) Relocation +Replication, a PS with
multi-technique parameter management but without sampling integration;
(iii) Relocation +Sampling, a relocation-only PS with sampling integration;
(iv) NuPS, a multi-technique PS with sampling integration. Going from a
single-technique relocation PS to a multi-technique PS made an epoch 67–73%
faster with only small effect on model quality. Adding sampling support to
the relocation PS made an epoch 17–62% faster, with a small negative effect
on model quality. The combination of both made an epoch 94% faster, with
a small negative effect on per-epoch model quality.
90
4.4. Experiments
Lapse NuPS (untuned) NuPS
¹
¹/₁₆
¼
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
Raw speedup
(a) KGE
Petuum:
no impl.
¹
¹/₁₆
¼
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
(b) WE
Petuum:
out of memory
¹
¹/₁₆
¼
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
(c) MF
Figure 4.7: Strong scaling (logarithmic axes). The y-axis depicts raw speedup,
i.e., speedup with respect to epoch run time over the shared-memory single
node baseline. NuPS scaled more efficiently than other PSs, with up to near-
linear speedups over the single node baseline.
4.4.4 Scalability
To investigate scalability, we ran Lapse, the best Petuum SSP and ESSP config-
urations, and NuPS for one epoch on 1, 2, 4, 8, and 16 nodes and calculated
the raw speedup. Figure 4.7 depicts the results. Further, we ran convergence
experiments on 16 nodes for those systems that reached the 90% model qual-
ity threshold on 8 nodes. Figure 4.8 depicts the effective speedup for these
systems.
Overall, NuPS scaled more efficiently than other PSs, with up
to near-linear raw and up to superlinear effective speedups.
We first discuss raw scalability, i.e., the speedup with respect to epoch
run time (Figure 4.7). On a single node, NuPS and Lapse were faster than
Petuum because NuPS and Lapse access local parameters via shared memory,
whereas Petuum sends intra-process messages to do so. Lapse provided poor
scalability because (with the latency hiding technique) the more nodes are
used, the higher the chance that multiple nodes access a parameter at the
same time and, thus, that they have to wait for a relocation to finish or to
access parameters remotely. Neither Petuum ESSP nor SSP outperformed
the shared-memory single node baseline, even on 16 nodes. ESSP scaled
poorly even when compared to its own (inefficient) run time on a single
node (4.8x faster on 16 nodes) because its eager replication protocol over-
communicates: after a short warm-up period, each node holds a replica of
the full model. The more nodes, the more replicas had to be synchronized,
91
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
E
ective speedup
(a) KGE
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
(b) WE
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
(c) MF
Figure 4.8: Effective scalability (logarithmic axes). The y-axis depicts effective
speedup, i.e., speedup with respect to reaching 90% of the best model quality
observed on a single node.
such that synchronization became a bottleneck. The lazy SSP protocol scaled
better than ESSP compared to its own (inefficient) single node run time (12x
faster on 16 nodes), but its overall performance was poor because its replicas
were cold most of the time (and thus required synchronous replica refreshes).
NuPS scaled more efficiently than existing PSs because it (i) limits the
bottleneck of eager replication by replicating only a small subset of hot spot
parameters, (ii) prevents the majority of relocation conflicts by employing
relocation only for long tail parameters, and (iii) employs sampling schemes
to reduce sampling communication overhead. With 16 nodes, it provided up
to 13.4x raw speedups over the shared memory single node. NuPS further
provided up to 20x effective speedups for KGE and 8x for MF (see Figure 4.8).
For WE, although the raw speedup on 16 nodes was 10.2x, the effective
speedup was only 2.2x. The reason for this is that we used the hyperparameter
configuration that worked best on the single node throughout allexperiments.
With other hyperparameters, we observed better effective speedups for WE.
4.4.5 Effect of Sampling Schemes
We investigated the effect of different sampling schemes in NuPS on run
time and model quality. To do so, we ran KGE and WE with different
samplingschemes: independentsampling(
CONFORM
),U=16and U=64 sample
reuse without postponing (
BOUNDED
) and with postponing (
LONG-TERM
), and
local sampling (
NON-CONFORM
). Figures 4.9a and 4.9c show model quality
over time, Figures 4.9b and 4.9d show model quality over epoch. We omit
MF as it does not contain sampling access. We further omit the results
from sample reuse with postponing as its results were within 10% of sample
92
4.4. Experiments
Single node
Independent sampling (CONFORM)
Sample reuse, U=16 (BOUNDED) Sample reuse, U=64 (BOUNDED)
Local sampling (NOT_CONFORM)
0.00
0.05
0.10
0.15
0 2 4 6
Run time (hours)
MRR (ltered)
(a) KGE (quality over time)
🠜🠜 local sampling
with static allocation
0.00
0.05
0.10
0.15
0 1 2 3 4 5 6 7 8 9 10
Epoch
MRR (ltered)
(b) KGE (quality over epoch)
0
20
40
60
0 2 4 6
Run time (hours)
Accuracy
(c) WE (quality over time)
0
20
40
60
0 1 2 3 4 5 6 7 8
Epoch
Accuracy
(d) WE (quality over epoch)
Figure 4.9: Performance of different sampling access management techniques.
Both sample reuse and local sampling led to significant speedups over indepen-
dent sampling.
93
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
reuse without postponing.
15 We found that both sample reuse and local
sampling led to significant speedups over independent sampling, with
small negative or—in the case of local sampling—even positive effects on
per-epoch model quality.
Independent Sampling and Sample Reuse
Independent sampling provided per-epoch quality near-identical to the single
node, but was slowest, because it induced high communication overhead for
each sample. Sample reuse had lower communication overhead (and, thus,
faster epoch run times), but at the cost of a (small) negative effect on per-
epoch model quality. The higher the use frequency, the faster an epoch and
the larger the negative effect on quality. The U=16 variant provided a good
compromise, with minimal effect on model quality and fast run times.
Local Sampling
Local sampling exhibited excellent performance, despite providing no guaran-
tees on sampling quality: it was fast and per-epoch model quality was as good
as the single node in WE, and was better in KGE. We hypothesize that this
mainly is because NuPS combines local sampling with dynamic allocation:
both tasks continuously relocate model parameters, such that the local param-
eter partitions contain many different parameters over time. To evaluate this
hypothesis, we ran local sampling with static allocation in KGE. Figure 4.9b
includes the results: with static allocation, model quality deteriorated drasti-
cally. We further conjecture that the reason for the better-than-single-node
quality of local sampling in KGE was that relocation led to local samples
that were more informative than global samples. Similar effects have been
observed previously (D. Zheng et al. 2020b).
4.4.6 Choice of Management Technique
We investigated how the choice of management technique, i.e., the choice of
whether to replicate or relocate a key, affects the performance of NuPS. The
NuPS untuned heuristic replicates the
900
most frequent keys in KGE, the
3272 most frequent keys in WE, and the 755 most frequent column keys in
MF. We varied these numbers by factors
1
64
,
1
16
,
1
4
, 4, 16, 64, and 256. The
leftmost columns of Table 4.4 depict what share of keys was replicated for
each setting. We ran one epoch of each setting and measured epoch run time
and model quality. Figure 4.10 depicts the results.
We found that it was
15
Postponing made no measurable difference in KGE, and sped up WE run times by 10%,
with no measurable impact on model quality.
94
4.4. Experiments
NuPS heuristic
0.094
0.095
0.095
0.097
0.095
0.093
0.075 (2.56 syncs/s)
0.045 (0.63 syncs/s)
0.003 (0.14 syncs/s)
367 min 🠞🠞
87 min 🠞🠞
(256x) 230400
(64x) 57600
(16x) 14400
(4x) 3600
(1x) 900
(1/4x) 225
(1/16x) 56
(1/64x) 14
(0) 0
0 10 20 30 40 50
Epoch run time (minutes)
# of replicated keys
🠐🠐more replication
(a) KGE
NuPS heuristic
46.9
46.6
46.8
46.4
45.3
45.3
45.3
45.0
40.0 (0.05 syncs/s)
(256x) 837632
(64x) 209408
(16x) 52352
(4x) 13088
(1x) 3272
(1/4x) 818
(1/16x) 204
(1/64x) 51
(0) 0
0 50 100 150 200 250
Epoch run time (minutes)
# of replicated keys
🠐🠐more replication
(b) WE
NuPS heuristic
1.24
1.25
1.25
1.25
1.25
1.25
1.25
1.25
1.19
10 min 🠞🠞
(256x) 193280
(64x) 48320
(16x) 12080
(4x) 3020
(1x) 755
(1/4x) 189
(1/16x) 47
(1/64x) 12
(0) 0
01234
Epoch run time (minutes)
# of replicated keys
🠐🠐more replication
(c) MF
Figure 4.10: Impact of the management technique on epoch run time and
model quality. The numbers in the plots depict model quality. A run is marked
red if the resulting model quality was not within 10% of the model quality
without replication. For these runs, the numbers in the plots additionally
depict the actual synchronization frequency.
95
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
crucial for performance to replicate “enough parameters such that the
set of hot spot parameters is managed by replication, but not too many
parameters, as replication created significant over-communication for
long tail parameters.
This effect was visible for all tasks: starting from no replicated keys (i.e.,
all keys managed by relocation), increasing the number of replicated keys
first improved run time, and had minimal effect on model quality. However,
after some point, replicating more keys deteriorated model quality, and even
slowed down run time for KGE and MF. The reason for the negative effect
on model quality was that the replicas were stale, because the replica updates
became too large to synchronize them frequently over the network of the
cluster. We configured NuPS to provide the default
40ms
staleness bound
(i.e., 25 synchronizations per second), but to not block operations when
it did not reach this goal. Figure 4.10 includes the actual synchronization
frequency if model quality was not within 10% of the model quality without
replication. The middle columns of Table 4.4 provide the size of the replicated
values for all settings. For example, the 64x WE setting replicated
799MB
of parameter values. Large numbers of replicated keys led to slower epoch
run times for KGE and MF, because relocation operations competed with
replica synchronization for network bandwidth. This effect was not visible
for WE because, in WE, the majority of accesses went to replicated keys (and,
thus, were fast despite network congestion). The share of accesses that went
to replicas is depicted in the rightmost columns of Table 4.4. For example,
88% of all accesses went to replicas in the 64x WE setting.
4.4.7 Effect of Replica Staleness
We investigated the effect of replica staleness on epoch run time and model
quality. To do so, we varied synchronization frequency: we synchronized
replicas either 125, 25, 5, 1, or 0.2 times per second or not at all. We ran
one epoch of each setting and measured epoch run time and model quality
after this epoch. Note that without replica synchronization, nodes may hold
different models. In these cases, we evaluated the model of the first node.
Figure 4.11 reports the results.
Overall, replication had only minimal
effect on model quality when replica staleness was low.
Replication had only small effect on model quality when replicas were
synchronized at least 5 times per second. In contrast, infrequent synchroniza-
tion (less than once per second) deteriorated model quality drastically in KGE
and WE. However, infrequent synchronization (or no synchronization at
all) worked well in some settings (in particular in MF). We speculate that the
96
4.4. Experiments
NuPS default
0.000
0.018
0.065
0.089
0.095
0.096
0
0.2
1
5
25
125
0 5 10 15 20 25
Epoch run time (minutes)
sync. frequency
(syncs. / sec.)
🠐🠐higher
staleness
(a) KGE
NuPS default
43.1
13.5
45.4
45.4
45.3
45.4
0
0.2
1
5
25
125
0 10 20 30 40
Epoch run time (minutes)
sync. frequency
(syncs. / sec.)
🠐🠐higher
staleness
(b) WE
NuPS default
1.25
1.29
1.25
1.25
1.25
1.24
0
0.2
1
5
25
125
0 1 2 3
Epoch run time (minutes)
sync. frequency
(syncs. / sec.)
🠐🠐higher
staleness
(c) MF
Figure 4.11: Effect of replica staleness on epoch run time and model quality.
The numbers in the bars depict model quality after one epoch. Bars are marked
red if model quality was not within 90% of the quality produced by a setting
with no replication.
97
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
Table 4.4: Share of replicated keys, replica size, and share of accesses to replicas
for different extents of replication. A cell is marked red if the resulting model
quality was not within 10% of the quality without replication.
Replicated
keys (%) Size of replicated
values (MB) Accesses to
replicas (%)
Factor KGE WE MF KGE WE MF KGE WE MF
0 0.0000 0.0000 0.0000 0 0 0 0 0 0
1/64x 0.0003 0.0027 0.0001 0 0 0 23 7 3
1/16x 0.0012 0.0108 0.0004 0 1 0 33 13 5
1/4x 0.0047 0.0435 0.0017 2 3 1 38 25 9
1x 0.0187 0.1740 0.0069 7 12 6 41 45 14
(heuristic)
4x 0.0747 0.6958 0.0275 27 50 23 44 67 19
16x 0.2988 2.7832 0.1098 110 200 92 45 82 24
64x 1.1951 11.1330 0.4393 439 799 369 47 88 30
256x 4.7806 44.5319 1.7571 1758 3195 1475 52 92 37
reason for this was that NuPS employs replication for only a small subset of
parameters, such that replication parameters are kept synchronized indirectly
through the parameters that are managed by relocation.
4.4.8 Comparison to Task-Specific Implementations
In a general-purpose system, a performance overhead over optimized task-
specific implementations is expected. To investigate the extent of this over-
head in NuPS, we compared to specific implementations for each task. Each
of these implementations is specialized and highly tuned for the respective
task. In contrast to a general-purpose PS such as NuPS, these implemen-
tations cannot be used to run other ML tasks. Note that some of these
implementations use different, more complex training algorithms than the
implementations in NuPS.
Overall, we found that NuPS was competitive
to specialized and tuned task-specific implementations.
Matrix Factorization
For MF, we compared to the highly tuned MPI implementations of DSGD
and DSGD++ (Teflioudi et al. 2012). We ran convergence experiments on 8
and 16 nodes. We measured how long the implementations took to reach the
90% quality threshold. We used the same hyperparameters, model starting
points, and learning rate schedule across DSGD, DSGD++, and NuPS. On 8
nodes, NuPS was 16% faster than DSGD and 15% slower than DSGD++. On
16 nodes, NuPS was 37% faster than DSGD and 16% faster than DSGD++.
98
4.4. Experiments
Knowledge Graph Embeddings
For KGE, we compared to the highly specialized PyTorch-BigGraph frame-
work (Lerer et al. 2019). Note that PyTorch-BigGraph is designed for a
different training algorithm, with different hyperparameters: to reduce com-
munication overhead, it uses mini-batch SGD, whereas the KGE implemen-
tation in NuPS employs regular SGD (i.e., batch size 1).
16
To minimize the
impact of algorithm hyperparameters in our comparison, we compared epoch
run times. NuPS ran an epoch in 12 minutes on 16 nodes (24 minutes on 8
nodes). In this setting (i.e., batch size 1), PyTorch-BigGraph was much slower
than NuPS: it took more than 5 hours to run one epoch, both on 8 and 16
nodes. Using a very large batch size led to faster epochs in PyTorch-BigGraph
(up to 3x faster with batch size
1000
than NuPS with batch size 1), but can
also be implemented in NuPS.
Word Embeddings
For WE, we are not aware of a highly tuned and publicly available distributed
implementation, so we compared to two highly tuned single node implemen-
tations: the original C implementation of Word2Vec (Mikolov et al. 2013)
and Gensim (ˇ
Reh˚
uˇ
rek and Sojka 2010). The implementation in Gensim
and the one in NuPS are both based on the original C implementation. For
both single node implementations, we achieved the fastest epoch run times
with 64 threads. Gensim completed an epoch in 15 minutes, the original
implementation in 12 minutes. With 8x8 threads, NuPS took 13.5 minutes
for one epoch; with 16x8 threads, it took 8 minutes. One factor that limits
the performance of NuPS compared to these task-specific implementations
is that—as other general-purpose PSs (Mu Li et al. 2014a)—NuPS provides
per-key atomic updates. To achieve this, workers receive dedicated working
copies of parameters. Creating these copies and writing updates back into
the parameter store creates overhead compared to the task-specific WE imple-
mentations, which let workers read and write in the parameter store directly,
without any consistency or isolation guarantees. Empirically, this works well
for this particular task, but the effects for other tasks in a general-purpose
system are unclear.
16
This batch size stems from the C++ KGE implementation that we adopted for our
experiments (H. Liu et al. 2017).
99
Chapter 4. Handling Diversity: Non-Uniform Parameter Management
4.5 Summary
In this chapter, we explored how PSs can be efficient for tasks that exhibit
non-uniform parameter access. We discussed two major improvements for
PS efficiency in such tasks. First, we found that PSs can be more efficient if
they adapt their parameter management techniques to the access patterns of
individual parameters. Second, we found that PSs can be more efficient if
they account for different types of parameter access, i.e., direct and sampling.
Samplingsupport allowsNuPStotransparentlyuse suitable samplingschemes
to reduce communication overhead for sampling access.
An important limitation of the multi-technique parameter management
described in this chapter is that the technique for each parameter is picked
(i) by the application and (ii) statically, before the start of training. Picking
suitable techniques in this way requires either domain knowledge or hyper-
parameter tuning. The heuristic that we used in our experimental evaluation
provided decent results for some, but not all, ML tasks. In addition, similar
to Lapse, NuPS requires applications to manually initiate parameter reloca-
tion, and to potentially tune the timing of these initiations. In the upcoming
Chapter 5, we will describe how adaptive PSs can be efficient for many ML
tasks out of the box, without prior tuning.
100
Chapter 5
Attaining Ease of Use:
Automatic Adaptivity
In the previous chapters, we found that adapting individual aspects of the
PS to the underlying ML task can improve PS efficiency for ML tasks with
sparse parameter access. For example, Lapse dynamically adapts parameter
allocation (see Chapter 3). NuPS (see Chapter 4) adapts by combining dif-
ferent parameter management techniques (e.g., replication and relocation)
and picking a suitable one for each parameter; this allows NuPS to manage
parameters with different access patterns efficiently. Further, replication
PSs, such as Petuum (Ho et al. 2013; Dai et al. 2015), adaptively replicate
parameters on specific nodes when the nodes access these parameters.
However, to be efficient, these approaches require the application to
choose the right approach and to specify suitable performance hyperparame-
ters. These choices depend on the ML task, the workload, and even individ-
ual parameters. Making efficient choices often requires domain knowledge
and/or expensive upfront experimentation. These requirements make ex-
isting approaches complex to use for applications. For example, in Lapse,
applications need to modify their application code to initiate parameter relo-
cations and tune the timing of these relocations. In multi-technique PSs such
as NuPS, applications need to specify upfront which technique to use for
which parameter, and—for optimal performance—need to tune these choices
(even if the access frequency distribution can be computed relatively easily,
see Section 4.4.6). In Petuum, applications need to tune a staleness threshold
specifically for each ML task.
In this chapter, we explore whether PSs can adapt to the underlying ML
task automatically, i.e., without any prior tuning. To enable such automatic
adaptation, we propose intent signaling, a novel mechanism for passing infor-
mation about parameter access from the application to the PS. Intent signaling
101
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
decouples information from action: the application signals which parameters
it intends to access before it does so; the PS adapts automatically based on these
signals. Intent signaling is easier to use than existing approaches because it
requires no upfront information, requires no tuning, and integrates naturally
with the construction of training batches in common ML systems. Intent
signaling does not only allow for implementing most existing approaches, it
also enables more precise—and more efficient—parameter management.
Intent signaling opens a large design space for adaptive parameter man-
agement. We explore the main aspects of this space and describe AdaPS, a
novel PS that automatically (i.e., without user input) and adaptively (i.e.,
based on the current situation) decides what to do and when to do it. AdaPS
is easy to use as it requires no information besides intent signals and no knob
tuning. Behind the scenes, AdaPS dynamically picks a management technique
that is currently suitable for the parameter: if, at one point in time
t
, only
one node accesses the parameter, AdaPS relocates the parameter to this node;
otherwise, it replicates the parameter to precisely the nodes that have active
intent at
t
. Furthermore, AdaPS learns automatically when to act on an
intent signal, so that applications do not need to tune when they signal intent.
In our experimental evaluation, AdaPS was efficient across multiple large
ML tasks without requiring any tuning. It matched or even outperformed
existing (more complex to use) approaches out of the box.
This chapter is structured as follows. We begin by analyzing efficiency
and complexity of existing PSs (Section 5.1). We then propose intent signaling
(Section 5.2), describe AdaPS (Section 5.3), and investigate the performance
of AdaPS in an experimental study (Section 5.4).
5.1 Efficiency and Complexity of Existing Approaches
In this section, we briefly recap existing approaches for distributed parameter
management and analyze them with respect to ease of use and efficiency for
ML tasks with sparse parameter access. For adaptive approaches, we addi-
tionally analyze in which dimensions they are adaptive. Figure 5.1 illustrates
existing approaches. Table 5.1 summarizes our analyses.
5.1.1 Static Full Replication
Static full replication (Figure 5.1a) replicates the full model to all cluster nodes
statically (i.e., throughout training) and synchronizes the replicas periodically,
either synchronously (triggered by the application) or asynchronously in the
background. As this approach is entirely static, it requires no run time infor-
102
5.1. Efficiency and Complexity of Existing Approaches
mation from the application. So it is relatively easy to use. (It does, however,
require the application to either trigger replica synchronization or to set the
frequency of background synchronization.) Parameter access is fast, as every
worker can access every parameter locally, without synchronous network
communication. However, the full replication approach is communication-
inefficient for sparse workloads (as visible in Section 4.4.6, for example), as it
maintains the replicas of all parameters on all nodes throughout the training
task, even though each node accesses only a small subset of these replicas at
each point in time. Also, full replication limits model size to the memory
capacity of a single node.
In summary, full replication is very easy to use,
but inefficient for sparse workloads because it over-communicates.
5.1.2 Classic PS
A classic PS (Figure 5.1b) statically partitions the model parameters to the
cluster nodes and processes reads and writes by transparently sending mes-
sages to the corresponding nodes. A Classic PS is easy to use: it requires no
information from the application and no hyperparameter tuning. However,
the classic PS approach is inefficient because the vast majority of parameter
accesses involve synchronous network communication for sending messages
to the node that holds the parameter (see Sections 3.3.2 and 4.4.2).
In sum-
mary, a classic PS is very easy to use, but inefficient due to synchronous
network communication.
5.1.3 Replication PS
A replication PS, such as Petuum (Xing et al. 2015), partitions parameters
as a Classic PS does. During training, it adaptively replicates a subset of the
parameters to nodes that access these parameters. Petuum sets up replicas
reactively when a worker on a node accesses the parameter. Thus, the workers
have to wait for replicas to be set up synchronously.
The SSP protocol (Figure 5.1c) maintains a replica for an application-
specified number of logical clocks, the so-called staleness bound. Applications
have to tune this staleness bound specifically for each task, as the staleness
bound impacts both model quality and run time efficiency and these effects
differ from task to task (Ho et al. 2013). This tuning makes SSP complex to
use. Further, SSP is inefficient for many tasks because, for realistic staleness
bounds, no replicas are set up for the majority of parameter accesses, so that
workers have to wait for synchronous replica setup frequently.
In summary,
an SSP replication PS is complex to use because it requires tuning and
inefficient because of synchronous replica setup.
103
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
initial early late
node 1 node 2
(a) Static full replication
initial early late
node 1 node 2
(b) Classic
initial early late
node 1 node 2
(c) SSP replication
initial early late
node 1 node 2
(d) ESSP replication
initial early late
node 1 node 2
(e) Relocation
initial early late
node 1 node 2
(f) Multi-Technique
initial early late
node 1 node 2
(g) AdaPS
Figure 5.1: Parameters held by different nodes at different times (initially; and
early and late during training) in common parameter management approaches.
One square depicts one parameter. A node either (i) cannot access the param-
eter locally ( ), (ii) holds the main copy of the parameter ( ), or (iii) holds a
replica of the parameter ( ).
104
5.1. Efficiency and Complexity of Existing Approaches
Table 5.1: Approaches to distributed parameter management: adaptivity, ease
of use, and efficiency for sparse workloads. Some approaches adaptively set up
and destruct replicas, and some adaptively change the main location of param-
eters. However, existing approaches are mostly static with respect to choice of
management technique, and require the application to time adaptation.
Adaptivity
Approach Replication Parameter
location Choice of
technique Timing Ease of use Efficiency
Full
replication static (full) static single none ++ --
Classic PS
(PS-Lite) none static single none ++ --
Replication PS
(SSP) adaptive static single by application - -
Replication PS
(ESSP) adaptive static single by application - --
Relocation PS
(Lapse) none adaptive single by application -- -
Multi-Technique PS
(BiPS, Parallax) static (partial) static static none -- +
Multi-Technique PS
(NuPS) static (partial) adaptive static by application -- +
AdaPS
(this thesis) adaptive adaptive adaptive adaptive + ++
105
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
The ESSP protocol (Figure 5.1d) maintains the replica throughout the
entire training task. This mitigates the inefficiency of reactive replica creation
(as each replica is set up only once), at the cost of over-communication. For
many workloads, after a short setup phase, ESSP is essentially equivalent
to full replication. This makes ESSP inefficient for sparse workloads.
In
summary, an ESSP replication PS is complex to use, and inefficient due
to over-communication.
5.1.4 Relocation PS
A relocation PS, such as Lapse (Section 3.2), adaptively changes the location of
parameters during training, such that parameters can then be accessed locally
at the respective nodes, and no replica synchronization is required.
1
Key
for the efficiency of a relocation PS is that parameter relocation is proactive,
i.e., that parameter relocation runs asynchronously and is finished before the
parameter is accessed. The PS lacks the necessary information to trigger relo-
cations proactively. Thus, Lapse requires the application to trigger parameter
relocations manually via the additional
localize
primitive. Application
developers are required to add appropriate invocations to their application
code and (for optimal performance) tune the relocation offset, i.e., how long
before the actual access parameter relocation is triggered. Relocation should
be early enough such that it is finished when the parameter is accessed, but
not too early to minimize the probability of relocating the parameter away
from other nodes that are still accessing it. Offloading these performance-
critical decisions to the application makes relocation PSs complex to use and
leads to potentially sub-optimal performance. In addition, relocation PSs are
inefficient for many real-world ML tasks because they are inefficient for hot
spots, i.e., parameters that are frequently accessed by multiple nodes concur-
rently.
In summary, relocation PSs are complex to use, and inefficient
for many real-world ML tasks.
5.1.5 Multi-technique PS
Multi-technique PSs, such as NuPS, support multiple parameter management
techniques (e.g., replication, classic, or relocation) and let the application pick
a suitable one for each parameter. Parallax (S. Kim et al. 2019) and BiPS (Q.
Zheng et al. 2021) support static replication (i.e., creating replicas on all nodes)
and classic. NuPS supports static replication and relocation (Section 4.2.2).
The choice of technique in existing multi-technique PSs is static: for each
1
As before, we assume the general-purpose latency hiding technique here, i.e., that there is
no explicit locality through data clustering or parameter blocking. If there is explicit locality,
parameter relocation is highly efficient, as discussed in Section 3.1.2.
106
5.2. Intent Signaling
parameter, the application picks one technique before training, which is then
used throughout. Using a suitable technique for each parameter can improve
PS efficiency. However, it requires information about the workload. As
these PSs decide on a technique for each parameter before training, they
require this information upfront. There are heuristics that pick a technique
for each parameter (e.g., see Section 4.4.1 or (S. Kim et al. 2019)). These
heuristics require access frequency statistics and do not consistently achieve
optimal performance. Thus, manual tuning can be required to achieve high
efficiency. These information and tuning requirements make multi-technique
PSs very complex to use and—if not tuned appropriately—lead to sub-optimal
performance. NuPS additionally requires the application to manually trigger
relocations (as in Lapse), further complicating its use.
In summary, multi-
technique PSs are efficient, but very complex to use.
5.1.6 Summary
Adaptivity is key for efficient distributed parameter management for sparse
ML tasks. However, existing approaches adapt only with respect to a few
dimensions, see Table 5.1. Adaptivity also requires information about the un-
derlying workload. In existing ML systems, this information is not available
to the PS. Thus, current adaptive approaches place key parameter manage-
ment decisions (which technique/PS to use, when to relocate, etc.) in the
hand of the application and require the application to tune these performance
knobs. This interdependence makes current adaptive approaches complex
to use: developers need to learn about PSs and their performance knobs,
modify application code, and run training multiple times for tuning. This
interdependence also limits efficiency: applications can make sub-optimal
decisions and neglect the tuning of performance knobs. The interdependence
also hinders the development of more adaptive approaches (e.g., combining
several types of adaptivity, or picking management techniques dynamically
during run time), as more adaptivity would further increase complexity.
5.2 Intent Signaling
To enable easy-to-use adaptive parameter management, we propose intent
signaling, a novel mechanism that naturally integrates into common ML
systems. Intent signaling passes information about upcoming parameter
access from the application to the PS. It decouples information from action,
with a clean API in between: the application provides information (intent
signals); the PS transparently adapts to the workload based on the intent
107
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
signals. I.e., all parameter management related decision-making and knob
tuning is done transparently by the PS. The application only signals intent.
An intent is a declaration by one worker that this worker intends to
access a specific set of parameters in a specific time window in the future.
A typical choice for the time window would be one training batch. For
example, a worker could signal I will access parameters 13 and 16 in batch 5.
We use logical clocks as a general way to specify the start and end points of an
intent. Each worker
i
has one logical clock
Ci
that is independent of other
workers’ clocks. (These clocks are used only to define time steps, i.e., no
clock synchronization among workers is necessary.)
2
Each worker advances
its clock with an
advanceClock()
primitive (as done in Petuum (Xing et
al. 2015), but, in contrast to Petuum, invocation of our
advanceClock()
is cheap, as it only raises the clock). For example, a worker could advance
its clock whenever it starts processing a new batch. Key is that intent is
signaled before the parameter is actually accessed, such that the PS has time
to adapt proactively.
Intent signaling integrates naturally with the data loader paradigm of
common ML systems: in common ML systems, there is one component—
usually in one or multiple separate threads—that prepares training batches
before they are processed by the worker thread(s). Examples are the data
loader in PyTorch (Paszke et al. 2019), TensorFlow (Abadi et al. 2016) datasets,
and the Gluon data loader in MXNet (T. Chen et al. 2015). While preparing
the training data for the batch, this component could signal intent for the
batch, before the training thread later accesses the corresponding parameters.
We propose the following primitive for signaling intent:
Intent(parameters,Cstart,Cend [,type] )
With this primitive, a worker signals that it intends to access a set of parame-
ters in the time window between a start clock
Cstart
(inclusive) and an end
clock
Cend
(exclusive). The primitive allows to (optionally) specify intent
type, e.g., read,write, or read+write. Figure 5.2 illustrates an example: with
Intent({13, 16}, 5, 6, read+write)
, a worker signals that it intends to
read and write parameters 13 and 16 while it is at clock 5 (i.e., in batch 5
if the worker advances its clock at the start of each batch). We say that an
intent is inactive if it is signaled, but the worker has not reached the start
clock yet, i.e.,
Ci<Cstart
. We say that an intent is active if the worker clock
is within the intent time window, i.e.,
Cstart Ci<Cend
. And we say that an
intent is expired when the worker clock has reached the end clock, i.e., when
2Applications can, of course, choose to synchronize the clocks of different workers.
108
5.3. The AdaPS Parameter Server
time
(worker clock Ci)
intent
signal
0 1 2 3 4 5 6 7 8 9 10
Intent({13,16}, 5, 6, read+write)
intent
inactive
intent
active
Figure 5.2: Example intent (for parameters 13 and 16).
Cend Ci
. Invocation of the
Intent
primitive is meant to be cheap, i.e., it
should not slow down the worker, even if the worker signals many intents.
Workers can flexibly combine intents: they can signal multiple (potentially
overlapping) intents for the same parameter, extend one intent by signaling
another intent later on, etc.
Algorithm 5.1 illustrates how intent signaling can be used in the dis-
tributed SGD example of Section 2.2. The program is similar to the one for a
Classic PS (Algorithm 2.3 on page 17), with two additions: (i) the data loader
signals intent when it prepares the batch (line 7) and (ii) the worker advances
its clock after each batch (line 13).
Intent signals allow for precise adaptation, more precise than existing
approaches. For example, a replication PS could use the signals to set up
a replica exactly while intent is active. In contrast to existing approaches,
it would not need to rely on heuristics to decide how long to maintain the
replica. Additionally, as intents are signaled before the actual access, the PS
could set up the replica proactively, before the worker accesses the parameter.
Intent signaling also enables more adaptive approaches. For example, as we
will describe in Section 5.3, intent signals allow AdaPS to choose suitable
management techniques dynamically during run time (rather than statically
using one technique per parameter, as existing approaches do), and to time ac-
tions appropriately (opposed to applications explicitly triggering relocations,
as done in Lapse and NuPS).
5.3 The AdaPS Parameter Server
Intent signaling opens a large design space for adaptive PSs. Key design
questions include: how to change parameter allocation, when and where to
maintain replicas, when to act on intent signals, how to synchronize replicas,
how to exchange intent signals, on which nodes to make decisions, and how
to communicate among nodes efficiently. We explore this design space and
describe AdaPS. AdaPS is a PS that requires no input beyond intent signals
and no knob tuning. It automatically decides how to act on intent signals
109
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
Algorithm 5.1:
Distributed asynchronous SGD with intent signal-
ing. Differences to the Classic PS implementation (Algorithm 2.3
on page 17) are highlighted in green.
Data: D: training dataset,
num_epochs: number of epochs to run,
batch_size: batch size,
t: the ID of this worker thread,
T: the number of total worker threads
1c=0// batch counter
2for epoch 1to num_epochs do
3b=num_batches ( D, batch_size, t, T )
//
data loading (pipeline parallel with training, in separate thread(s))
4B= []
5for i1to bdo
6Bi=prepare_batch ( i, D, batch_size, epoch, t, T )
7intent ( keys(Bi), c,c+1)
8c=c+1
// training
9for i1to bdo
10 w=pull ( keys(Bi))
11 w=compute_update ( Bi,w)
12 push ( keys(Bi), w)
13 advanceClock ()
110
5.3. The AdaPS Parameter Server
process at node 1
worker 1
worker 2
...
worker n
server
sync. intents
shared
memory
process at node 2
...
process at node 3
...
Parameters
no local access
main copy
replica
Figure 5.3: AdaPS architecture. For efficiency, AdaPS runs multiple worker
threads in one process per node, and accesses locally available parameters via
shared memory.
and when to do so. We give a brief overview of key design features before
we detail each one in the following subsections. Figure 5.3 illustrates the
architecture of AdaPS.
Automatic Choice of Technique
AdaPS employs relocation and replication, and automatically picks between
the two. This choice can change over time: AdaPS dynamically chooses
the right technique for the current situation. Intuitively, AdaPS relocates a
parameter if—at one point in time—only one node accesses the parameter.
Otherwise, it creates replicas precisely where they are needed. We discuss
AdaPS’s choice of technique in Section 5.3.1.
Automatic Action Timing
AdaPS learns automatically when the right time to act on an intent signal is.
This ensures that applications do not need to fine-tune the timing of their
intent signals. They can simply signal their intents early, without sacrificing
performance. See Section 5.3.2 for details.
Responsibility Follows Allocation
In AdaPS, the node that currently holds the main copy of a parameter takes on
the main responsibility for managing this parameter: it decides how to act on
intent signals and acts as a hub for replica synchronization. For efficiency, this
responsibility moves with the parameter whenever the parameter is relocated.
We describe details in Section 5.3.3.
111
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
Num.
nodes with
active intent
at time
t?Relocate to
the node
with intent
Maintain replicas on the
nodes with active intent
1
>1
Figure 5.4: AdaPS decides automatically whether to relocate or replicate a
parameter at any time t.
Efficient Communication
AdaPS communicates for exchanging intent signals, relocating parameters,
and managing replicas. To communicate efficiently, AdaPS locally aggregates
intents and sends aggregated intents over the network, groups messages when
possible to avoid the overhead of small messages, and employs location caches
to improve routing; see Section 5.3.4 for details.
Optional Intent
Intent signals are optional in AdaPS. An application can access any parameter
at any time and any node, without signaling intent. However, signaling
intent potentially makes access more efficient, as it allows AdaPS to avoid
synchronous network communication.
5.3.1 Automatic Choice of Technique
AdaPS receives intent signals from workers. Based on these intent signals,
AdaPS tries to ensure that a parameter can be accessed locally at a node while
this node has active intent for this parameter. To achieve this, AdaPS has to
work out where to ideally allocate a parameter, if and where to create replicas,
and for how long to maintain each replica.
To make parameters available locally, AdaPS employs (i) relocation and
(ii) selective replication. I.e., it (i) can relocate parameters from one node
to another, and (ii) can selectively create replicas on subsets of all nodes for
specific periods of time.
3
If, at one point in time, only one node has active
3
Selective replication is also used by SSP in Petuum. The main difference is that intent
112
5.3. The AdaPS Parameter Server
intent signal action window active intent parameter allocation parameter replica
time
node 0
node 1
node 2
node 3
(a) Non-overlapping intents
time
node 0
node 1
node 2
node 3
(b) Partially overlapping intents
time
node 0
node 1
node 2
node 3
(c) Many concurrent intents
Figure 5.5: Examples for parameter management in AdaPS.
intent for a parameter, AdaPS relocates the parameter to the node with active
intent. After the node’s intent expires, AdaPS keeps the parameter where
it is until some other node signals intent. In contrast, when multiple nodes
have active intent for one parameter at the same time, AdaPS selectively
creates a replica at each of the nodes when the intent of that node becomes
active. It destructs the replica when the intent of that node expires. Figure 5.4
illustrates AdaPS’s decision between relocation and selective replication.
Let us consider three exemplary intent scenarios to understand how
AdaPS manages parameters.
signals allow AdaPS to set up a replica before it is accessed and maintain it precisely while it is
needed. And, in contrast to AdaPS, Petuum employs only replication.
113
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
1.
Two nodes have intent for the same parameter, and the active phases
of the intents do not overlap; see Figure 5.5a. AdaPS relocates the
parameter from its initial allocation (node 0) to the first node with
intent (node 2). AdaPS keeps the parameter at this node even after the
intent expires. Before the intent of the second node becomes active,
AdaPS relocates the parameter to the second node with intent (node 3).
2.
Two nodes have intent for the same parameter, and the active periods
of the intents partially overlap; see Figure 5.5b. AdaPS relocates the
parameter to the first node with intent, then creates a replica on the
second node while the two active intents overlap, and relocates the
parameter to the second node after the intent of the first node expires.
3.
Multiple nodes repeatedly have intent for the same parameter, see Fig-
ure 5.5c. AdaPS creates replicas on all nodes with active intent. When-
ever there is exactly one node with active intent (and the parameter is
not currently allocated at this node), AdaPS relocates the parameter to
this node.
AdaPS combines parameter relocation and replication because they
complement each other well, as described in Section 4.2.2: relocation is
efficient for parameters that are accessed rarely, as in the first example above,
because the parameter value is transferred over the network only once per
access (from where the parameter currently is to where the intent is). In
contrast, replication can significantly reduce network overhead for frequently
accessed parameters. In contrast to previous approaches combining these
two (e.g., NuPS or BiPS (Q. Zheng et al. 2021)), AdaPS employs selective
replication: intent signals allow AdaPS to create a replica precisely for the
time during which a replica is needed. This increases efficiency as AdaPS does
not need to maintain replicas while they are not needed. Further, in contrast
to previous systems, the choice of parameter management techniques for each
parameter is dynamic: depending on the intent signals, AdaPS can relocate
the parameter at one point in time, and replicate it at another.
For making its decisions, AdaPS treats all intent types identically. More
complex approaches could tailor their choices to intent type. For example,
systems could choose to take different actions for “read” and “write” intents.
In AdaPS, we keep the system simple and treat all intent types identically
because we do not expect tailoring to improve performance for typical ML
workloads: (i) applications typically both read and write a parameter and (ii)
synchronous remote reads are so expensive that it is beneficial to provide a
locally accessible value for a parameter even for a single read.
114
5.3. The AdaPS Parameter Server
time
node 0
node 1
node 2
node 3
(a) Relocate only when exactly one node has active intent
time
node 0
node 1
node 2
node 3
(b) Relocate immediately when the owner’s intent expires
Figure 5.6: AdaPS relocates a parameter only when there is exactly one node
with active intent (and the parameter is currently not allocated at this node).
AdaPS relocates parameters only when there is (at one point in time)
exactly one node with active intent, and this node does currently not hold
the parameter. It does not relocate a parameter while multiple nodes have
active intent, even if the intent of the current owner expires, see Figure 5.6.
AdaPS employs this approach because relocating in the presence of replicas
would require (i) to update routing information on each replica holder (see
Section 5.3.4) and (ii) to transfer intent information from the current owner
to another node (see Section 5.3.3).
5.3.2 Automatic Action Timing
AdaPS receives intent signals before the intents become active. I.e., there is
an action window between the time the intent is signaled and the time the
intent becomes active. AdaPS needs to work out at which point in this action
window it should start to act on the intent signal, i.e., when it relocates the
parameter or sets up a replica for this parameter. For example, consider the
intent of node 3 in Figure 5.5b: AdaPS needs to figure out at which point in
time it starts maintaining a replica on node 3.
Relocating a parameter or setting up a replica takes some time. Con-
sequently, if AdaPS acts too late, relocation or replica setup is not finished
in time, such that the parameter is not available locally and, instead, has to
115
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
be accessed remotely, slowing down training. On the other hand, if AdaPS
acts too early, it might maintain a replica longer than needed. For example,
imagine AdaPS would set up the replica on node 3 in Figure 5.5b immediately
after the intent is signaled. This would cause AdaPS to over-communicate,
as it would send updates to a replica that are never read. Furthermore, if
AdaPS acts too early, it might use replication in scenarios in which—with
better timing—relocation would have been both possible and more efficient.
For example, consider the scenario in Figure 5.5a: if AdaPS acts on the intent
of node 3 immediately after the intent signal, it would need to maintain a
replica on node 3 while the intent of node 2 is still active. I.e., in that period,
it would send any update of node 2 to node 3 unnecessarily.
However, acting on an intent signal (slightly) too early is much cheaper
thanactingtoolate. The reasonforthisis that acting toolate slowsdowntrain-
ing significantly because the worker is forced to access the parameter remotely.
Incontrast, acting slightly tooearlymerelycauses over-communication. Thus,
it is desirable to err on the side of acting too early.
The key challenge for AdaPS is that both (i) the preparation time, i.e.,
the time it takes to relocate or set up a replica and (ii) the length of the action
window are unknown. The length of the action window is unknown because
AdaPS does not know when the worker will reach the start clock of the intent.
Both times are affected by many factors, e.g., by the application, the compute
hardware, the network hardware, and the utilization of that hardware.
Learning When to Act
AdaPS aims to learn when the right time is to act on an intent signal. A
general approach would estimate both preparation time and the length of
the action window separately. However, AdaPS acts on intent signals in
point-to-point communication rounds (see Section 5.3.4) that take a fairly
constant amount of time. AdaPS thus simplifies the general approach and
directly estimates the number of worker clocks per communication round.
This allows AdaPS to decide whether an intent signal should be included in
the current round or whether it suffices to include the signal in a later round.
The intent can be included in a later round if the next round will finish before
the worker reaches the start clock of the intent.
As acting (slightly) too early is much cheaper than acting too late, our
goal is to estimate a soft upper bound for the number of clocks during one
communication round. I.e., we want to be confident that the true number of
worker clocks only rarely (ideally, never) exceeds this soft upper bound. To
this end, we employ a probabilistic approach in AdaPS: we assume that the
116
5.3. The AdaPS Parameter Server
Ci
t
current
worker
clock
ˆ
λi
t
clocks
during this
comm. round
(round t)
ˆ
λi
t
clocks
during next
comm. round
(round t+1)
Poisson(2ˆ
λi
t)
(shifted by Ci
t)
C
C
C
In communication round t, AdaPS acts on
intents that start in this window.
0.9999
quantile
Figure 5.7: AdaPS learns automatically when to act on an intent signal. It
employs a probabilistic model to estimate a soft upper bound for the number
of clocks by a worker.
number of clocks follows a Poisson distribution, estimate the (unknown) rate
parameter for the distribution from past communication rounds, and use
a high quantile of this Poisson distribution as a soft upper bound (e.g., the
0
.
9999 quantile). In detail, we assume that the number of clocks by worker
i
in round
t
follows
Poisson
(
λi
t
)with expected rate
λi
t
. We choose a Poisson
distribution because it is the simplest, most natural assumption, and it worked
well in our experiments. Note that we assume a Poisson distribution for a
short period of time (one round
t
by one worker
i
), not one global Poisson
distribution (a much stronger, unrealistic assumption, which, for example,
would not account for changes in workload or system load).
AdaPS acts on a given intent in round
t
if it estimates that the corre-
sponding worker might reach the start clock of the intent (
Istart
) before round
t+1 finishes, i.e., roughly4if
Istart <Ci
t+QPoiss(2·λi
t,p)
where
Ci
t
isthecurrentclockofworker
i
atthestartofround
t
and
QPoiss
(
λ,p
)
computes the
p
quantile of a Poisson distribution with rate parameter
λ
. Fig-
ure 5.7 illustrates this decision. Throughout our experiments, we used the
quantile
p
=0
.
9999. Under our Poisson assumption, this gives a 99.99%
probability that the actual number of clocks by the worker during the two
communication rounds is below our estimate.
4The exact decision is given in Algorithm 5.2, which follows in Section 5.3.2.
117
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
Estimating the Rate Parameter
Naturally, the true Poisson rate
λi
t
is unknown. AdaPS estimates this rate
from the number of clocks in past rounds, using exponential smoothing:
ˆ
λi
t(1α)ˆ
λi
t1+α(Ci
tCi
t1)
where
ˆ
λi
t
is the estimate for the number of clocks by worker
i
in round
t
and
α
is the smoothing factor (we used
α
=0
.
1 throughout our experiments).
We consider two further aspects to improve the robustness of the esti-
mate
ˆ
λi
t
. First, in ML training tasks, there commonly are periods in which
the workers do not advance their clocks at all. For example, this is commonly
the case at the end of an epoch, while training is paused for model evaluation.
In such periods, the estimate for the number of clocks per round would
shrink. To keep the estimate more constant during such periods, AdaPS does
not update the estimate when the worker did not raise its clock during the
previous communication round (i.e., if Ci
tCi
t1=0).
Second, the observed number of clocks during round
t
1 (i.e.,
Ci
t
Ci
t1
=
) is not independent of the estimate
ˆ
λi
t1
: if the estimate was too
low, AdaPS did not act on some intents that the worker reached in this round,
such that the worker needed to access the corresponding parameters remotely,
which typically slows down the worker drastically. Thus, the estimate could
settle in a “slow regime”. A large enough Poisson quantile (i.e.,
p
0
.
5)
ensures that the estimate grows out of such regimes over time. AdaPS further
uses a simple heuristic to get out of such regimes more quickly: if the number
of clocks in the last round is larger than the current estimate, it uses this
number rather than the estimate (i.e., it uses max(ˆ
λi
t,)).
Algorithm 5.2 depicts precisely how AdaPS decides whether to act on a
given intent and how it updates the estimate.
Effect on Usability
AdaPS’s automatic action timing relieves applications from the need to signal
intent “at the right time”, as applications need to do for triggering relocations
inLapse and NuPS.Itis important thatapplicationssignal intentearlyenough
so that there is enough time for AdaPS to act on the intent signal. Above
this lower limit, however, automatic action timing makes AdaPS insensitive
to when intent is signaled. Thus, applications can simply signal intent early,
without sacrificing performance, relying on AdaPS to figure out when it
should act on the signal.
118
5.3. The AdaPS Parameter Server
Algorithm 5.2:
Automatic action timing in AdaPS. Should AdaPS
act on a given intent in communication round t?
Data: Intent start Istart, previous estimate ˆ
λi
t1, clock of worker iat
the start of round t(Ci
t) and round t1 (Ci
t1), smoothing
factor α, quantile p.
1Ci
tCi
t1
2if >0then
3ˆ
λi
t(1α)ˆ
λi
t1+α()
4else
5ˆ
λi
tˆ
λi
t1
6return Istart <Ci
t+QPoiss(2·max(ˆ
λi
t,),p)
5.3.3 Responsibility Follows Allocation
Based on intent signals, AdaPS decides when to relocate a parameter and
when to maintain a replica on which node. Two important design decisions
to enable this precise management are: (i) which node makes these decisions
and (ii) when replicas exist, how to keep them synchronized (efficiently)?
A key feature of AdaPS is that the node at which a parameter is currently
allocated—the owner node
5
of the parameter—takes the main responsibility
for both. The owner node decides whether to relocate a parameter and where
to maintain replicas (Section 5.3.3); and the owner node acts as a hub for
replica synchronization (Section 5.3.3). Placing responsibility at the owner
node reduces network overhead and can reduce processing load because the
owner node changes whenever the parameter is relocated, such that responsi-
bility is close to where the parameter and the associated processing is.
Choice of Management Technique
AdaPS chooses a technique based on the intent signals of all nodes for a
parameter. Thus, the intent signals of all nodes for one parameter need to
come together at one node, such that this node can make this decision.
This decision could be made by a node that is statically assigned to a
parameter (e.g., by hash partitioning). Adapting the terminology that we
used in Section 3.2, we refer to such a statically assigned node as the home
node. In the static approach, nodes continuously send their intent signals for
parameter
k
to the parameter’s home node. The home node decides whether
to relocate or replicate the parameter, and instructs the current owner of
parameter
k
to act accordingly. The advantage of this static approach is that
5
Weadapttheterminologythatweusedtodescribe asimilarconceptinLapseinSection3.2.
119
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
responsibility network communication
time
node 0
node 1
node 2
node 3
(a) Static responsibility
time
node 0
node 1
node 2
node 3
(b) Dynamic responsibility
Figure 5.8: Network communication for placing management responsibility
(a) on the statically assigned home node (node 0) or (b) on the dynamically
changing owner node.
120
5.3. The AdaPS Parameter Server
it is straightforward to route intent signals, as the signals for one key are
sent to the same node throughout training. Its disadvantage is that the home
node is always involved, even when it does not have intent itself (a common
disadvantage of home-based approaches in distributed systems (Steen and
Tanenbaum 2017)). Figure 5.8a illustrates the communication for the static
approach. Even while there is intent only at node 2 and the parameter is
allocated at node 2, node 2 needs to communicate its intent to the home
node (node 0), such that the home node can decide to keep the parameter
allocated at node 2. While there is intent at nodes 2 and 3, both nodes need
to communicate their intent to node 0.
To overcome this problem, AdaPS makes these decisions on the owner
node of the parameter, i.e., the node where the parameter is currently al-
located. This node changes whenever the parameter is relocated. The key
advantage of this approach is that the home node does not need to be involved,
reducing network traffic and processing load. Figure 5.8b illustrates the com-
munication for this dynamic approach. After an initial communication of
intent (and a subsequent relocation), no further communication between
node 2 and node 0 is necessary, as node 2 makes any decisions about the
parameter locally. While there is intent on node 2 and node 3, node 3 commu-
nicates its intent directly to node 2, without involving node 0. A disadvantage
of the dynamic approach is that routing becomes more complex, as the owner
node changes throughout training. To overcome this disadvantage, AdaPS
employs location caches, which enables nodes to send their intent signals to
the current owner node directly, most of the time, see Section 5.3.4.
Replica Synchronization
While a parameter is replicated, multiple nodes hold a copy of the value of
the parameter and write updates to this local copy. For convergence, it is
crucial that these copies are synchronized, i.e., that updates of one node are
propagated to the replicas on other nodes. AdaPS employs relocation and
selective replication (as discussed in Section 5.3.1). Consequently, for the
majority of parameters, at a given time, only few nodes (if at all) hold a replica
for a given parameter. Thus, replica updates have to be propagated only to a
small subset of all nodes. Further, this subset is different for each parameter
and changes constantly and potentially rapidly. These two properties make
all-reduce or gossip-based synchronization approaches unattractive for AdaPS.
Instead, AdaPS propagates replica updates via the owner node of a parameter:
replica holders send updates to the parameter’s owner, which then propagates
them to other replica holders.
121
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
To use the network efficiently, AdaPS batches replica updates (as, e.g.,
Petuum (Xing et al. 2015) does). That is, AdaPS does not immediately send
replica updates when they are received. Instead, AdaPS slightly delays in-
dividual updates so that it can potentially send multiple updates together.
As Petuum, AdaPS does so for both directions of update propagation: from
replica holders to the owner node and from the owner node to the replica
holders. To further improve efficiency, AdaPS versions parameter values and
communicates deltas: when nodes request updates for their replicas, they
include the version number that is locally available; the owner node sends
only updates that the requesting node has not received before.
5.3.4 Efficient Communication
AdaPS communicates to exchange intent signals, to relocate parameters, to
set up and destruct replicas, and to synchronize replicas. Key for the overall
efficiency of AdaPS is that this communication is efficient. In this section,
we discuss several design aspects of AdaPS that improve communication
efficiency: AdaPS locally aggregates intent signals and sends aggregated intent
signals over the network (Section 5.3.4), groups messages (Section 5.3.4), and
employs location caches for more efficient routing (Section 5.3.4).
Aggregated Intent
As discussed in Section 5.3.3, for each parameter, there is one (dynamically
changing) node that decides whether to relocate or replicate a specific param-
eter. To enable this decision, all other nodes need to continuously send their
intent signals for this specific parameter to this decision-making node.
A naive approach to intent communication is that all workers eagerly
send each intent (i.e., a tuple of parameter, start clock, end clock, intent type,
6
and worker id
7
) to the node that makes decisions immediately after the intent
is signaled. Additionally, workers regularly send the state of their clocks
(potentially by piggybacking on other messages). The decision-making node
stores all intents, determines which of them are active, and decides based on
this perfect view of all intents for this specific parameter. However, this naive
approach induces significant network overhead, as a large number of intent
signals have to be sent over the network constantly. Especially for hot spot
parameters, for which there are many intent signals, this can be prohibitive.
6
AdaPS does not need to communicate intent type, as it does not require to know the
intent type for its decisions, see Section 5.3.1.
7
So that the system knows which worker’s clock the start and end clocks refer to, and on
which node the intent was signaled.
122
5.3. The AdaPS Parameter Server
AdaPS employs a more communication-efficient approach: each node
stores inactive intents locally, determines which ones should be treated as
active, and sends aggregated information about active intents to the decision-
making node. More precisely, each node communicates to the decision-
making node when intent becomes active and it communicates when intent
expires. The node does not communicate which or how many workers
have active intent. This requires significantly less network communication,
especially for hot spot parameters. The disadvantage is that the decision-
making node has less information. Precisely, it does not know about inactive
intents or how long active intent will last. For the decisions that AdaPS
makes, this information is not required, such that AdaPS adopts the more
communication-efficient approach.
Message Grouping
Automatic action timing ensures that a parameter is relocated or a replica
is set up asynchronously, i.e., before a worker accesses the parameter. This
allows AdaPS to improve network efficiency by grouping messages for com-
municating (aggregated) intent signals, for relocating parameters, and for
creating, destructing, and synchronizing replicas into one request–response
message protocol.
In more detail, to send a synchronization request, a dedicated thread (the
sync. thread in Figure 5.3) at a node collects (i) a list of parameters for which
local workers have active intent and (ii) all updates to local replicas. The node
sendsthesetotheownersofthecorrespondingparametersinasynchronization
request. Each owner (i) merges the replica updates into its parameter store
and (ii) responds to each intent signal (with parameter relocation or replica
setup). It does so in one (grouped) synchronization response. By default, AdaPS
triggers a synchronization request as soon as the last communication round
has finished. To reduce the network and CPU load for synchronization,
AdaPS allows for limiting the number of communication rounds per second.
Routing
In AdaPS, nodes send intent signals and replica updates to the current owner
node. This owner node can change dynamically during run time. To route
messages, AdaPS adapts the home node forwarding approach (with location
caches) of Lapse. We briefly recap this approach below. See Section 3.2.3 for
more details.
As fallback, there is one home node for each parameter. This home node
is assigned statically to each parameter (by hash partitioning). The home
123
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
node knows which (other) node is currently the owner of the parameter. If
any node does not know where a parameter is currently allocated, it sends its
message to the home node, which then forwards the message to the current
owner. Whenever a parameter is relocated, the old owner node informs the
home node of this relocation. These location updates are piggybacked onto
synchronization messages.
To increase efficiency, AdaPS additionally employs location caches. I.e.,
each node locally stores the last known location for parameters that it accessed
in the past. This allows for sending updates and intent signals directly to the
current owner. AdaPS uses synchronization responses, outgoing parameter
relocations, and responses to remote parameter accesses to update location
caches. It does not explicitly invalidate location caches. Instead, it tolerates
that messages can be routed based on stale ownership information and relies
on the receiving nodes to forward the messages to the current owner (via
the home node, see Section 3.2.3 for a more detailed discussion). Location
caches are more important in AdaPS than in Lapse as there are scenarios
in AdaPS in which nodes repeatedly send messages to the owner node. In
particular for hot spot parameters, the owner node changes rarely (see, e.g.,
Figure 5.5c), such that nodes send their signals and updates to the same owner
node repeatedly.
5.4 Experiments
We conducted an experimental study to investigate whether and to what
extent fully adaptive parameter management is beneficial for the performance
of distributed ML. The source code, datasets, and information on reproducing
our experiments are available online.8
In this study, we evaluated the performance of AdaPS and compared it to
the performance of state-of-the-art PSs (Section 5.4.2). Further, we evaluated
its scalability (Section 5.4.3), the efficiency of different management tech-
niques (Section 5.4.4), whether action timing is crucial for the performance
of AdaPS (Section 5.4.5), and which decisions AdaPS makes when applied to
real-world ML tasks (Section 5.4.6). Our major insights are: (i) AdaPS was
efficient without any tuning, (ii) AdaPS even outperformed state-of-the-art
PSs on multiple tasks, (iii) AdaPS was more scalable than state-of-the-art PSs,
and (iv) automatic action timing made AdaPS efficient for early intent signals.
We conclude that PSs can be efficient and easy to use.
8https://github.com/alexrenz/AdaPS/tree/review
124
5.4. Experiments
Table 5.2: ML tasks, models, and datasets.
Task Model parameters Data
Model Keys Values Size Dataset Data points Size
Knowledge graph embeddings ComplEx, dim. 500 4.8M 4.8 B 35.9GB Wikidata5M 21M 317MB
Word embeddings Word2Vec, dim. 1000 1.9M 1.9B 14.0GB 1b word benchm. 375M 3GB
Matrix factorization Latent factors, rank 1000 11.0M 11B 163.9GB 10m×1m zipf 1.1 1000M 31GB
5.4.1 Experimental Setup
Tasks
We used the same three knowledge graph embeddings (KGE), word embed-
dings (WE), and matrix factorization (MF) tasks as in our previous study in
Section 4.4 (see the task descriptions on pages 83 to 84). The only difference
is that we used AdaGrad (Duchi et al. 2011) consistently through all three
tasks (rather than plain SGD for WE and MF). The tasks differ in multiple
ways, including the size of the models, the size of the dataset, with what rate
workers advance their clocks, and in their access patterns. Table 5.2 provides
a summary.
Baselines
We compared to a classic PS, to NuPS, and to a single node implementa-
tion. As classic PS, we used AdaPS without intent signals, which provided
performance similar to PS-Lite (Mu Li et al. 2014a). We used a shared mem-
ory implementation with 32 worker threads as the single node baseline. To
achieve good performance, NuPS required tuning for its main performance
hyperparameters: (i) choosing a management technique for each parameter
and (ii) specifying a relocation offset (i.e., how many steps ahead of time to
relocate a parameter).
To ensure a fair comparison, we ran six different configurations of NuPS.
Five of these six configurations are designed to simulate a typical hyperpa-
rameter search by an application developer: they represent a random search
that is loosely informed by the NuPS heuristic and intuition. In detail, we
generated five configurations quasi-randomly using the Sobol sequence imple-
mentation of Ax (Bakshy et al. 2018). For choosing techniques, we narrowed
the search range using the NuPS heuristic presented in Section 4.4.1 (based on
pre-computed dataset frequency statistics) for each task. We generated a set
of configurations that replicate 0.01x–100x as many parameters (because in
the experiments in Section 4.4.6, up to 64x deviation from the heuristic were
beneficial). For the relocation offset, we set the search space to 1–1000 as we
125
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
found that offsets of up to 512 can be beneficial (see Section 5.4.5). We use
the same set of configurations for the three tasks. We provide more details
on the search and the exact configurations online.
In addition to these five quasi-random configurations, we ran NuPS with
the hyperparameters of the tuned variant of Section 4.4. Note that these
hyperparameter choices are informed by a series of detail experiments (see
Sections 4.4.5 and 4.4.6). Such detailed insights are not commonly available
to application developers. However, note that this tuning has been done for
a setting with a lower level of parallelism: in Section 4.4, we used 4x fewer
worker threads (8 per node) than in this section (32 per node, see below).
Implementation and Cluster
We implemented AdaPS in C++, using ZeroMQ and Protocol Buffers for
communication, based on PS-Lite (Mu Li et al. 2014a), Lapse, and NuPS.
We used the same cluster as in Section 4.4, i.e., a cluster of up to 16 Lenovo
ThinkSystem SR630 computers, running Ubuntu Linux 20.04, connected
with 100 Gbit Infiniband. Each node was equipped with two Intel Xeon Silver
4216 16-core CPUs, 512 GB of main memory, and one 2 TB D3-S4610 Intel
SSD. We compiled code with g++ 9.3.0. Unless specified otherwise, we used
8 nodes and 32 worker threads per node. In NuPS and AdaPS, we additionally
used 3 ZeroMQ I/O threads per node. In AdaPS, we used 4 communication
channels per node; in NuPS, we used 1 communication channel, as NuPS
does not support multiple communication channels.
Task Hyperparameters
We tuned the task hyperparameters for each task on a single node and used
the best found hyperparameter setting in all systems and variants. For the
tasks that use negative sampling (KGE and WE), we used local sampling in
NuPS and AdaPS. In the classic PS, we used independent sampling, as local
sampling provides poor sampling quality in a classic PS (see Section 4.4.5).
In AdaPS, we used arbitrary large values for the intent signal offset (
1000
data points in KGE,
2000
sentences in WE, and
10000
data points in MF).
If chosen large enough, the offset did not affect AdaPS’s performance (see
Section 5.4.5 for details).
Measures
We ran all variants with a fixed
4h
time budget. We measured model quality
over time and over epochs within this time budget. We conducted 3 inde-
pendent runs of each experiment, each starting from a distinct randomly
126
5.4. Experiments
initialized model, and report the mean.
9
We depict error bars for model
quality and run time; they present the minimum and maximum measure-
ments. In some experiments, error bars are not clearly visible because of small
variance. Gray shading indicates performance that is dominated by the single
node baseline. We report two types of speedups: (i) raw speedup depicts the
speedup in epoch run time, without considering model quality; (ii) effective
speedup depicts the improvement in quality over time. The effective speedup
is calculated from the time that each variant took to reach 90% of the best
model quality that we observed in the single node baseline.
Differences to Chapter 4
The main difference between this experimental study and the one in Sec-
tion 4.4 is that we used a higher degree of parallelism: we used 32 worker
threads per node rather than 8. This 4x increase makes localization conflicts
more likely.
5.4.2 Overall Performance
We compared the overall performance of AdaPS to existing PSs (classic and
NuPS) and to the single node baseline. We ran each variant for the fixed
time budget and measured model quality. Figure 5.9 depicts the results.
In
summary, AdaPS matched or even outperformed the state-of-the-art PS
NuPS out of the box.
NuPS achieved good performance for KGE and MF (but not WE), but
required tuning to do so. Figure 5.9 shows three of the six NuPS configura-
tions that we ran: (i) the best and worst performing NuPS configurations
per task from our quasi-random hyperparameter search and (ii) the expertly
tuned hyperparameter configuration from Section 4.4. These results make it
clear that NuPS requires task-specific tuning: different configurations were
efficient for different tasks. For example, quasi-random configuration 4 was
the best one for the MF task, but the worst one for the WE task.
In contrast, AdaPS achieved good performance for all three tasks out
of the box, i.e., without requiring any tuning. AdaPS either matched (MF)
or outperformed (slightly in KGE
10
and drastically in WE) the performance
of even the tuned configuration of NuPS. AdaPS was able to outperform
NuPS because (i) AdaPS did not suffer from relocation conflicts (i.e., remote
parameter access caused by multiple nodes concurrently accessing a parameter
that is managed by relocation) and (ii) AdaPS maintained replicas only while
9For NuPS, we ran each of the 5 configurations once.
10With an epoch 20% faster in AdaPS than the tuned configuration of NuPS.
127
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
Single node (1x32) Classic
NuPS (best found cong for the task) NuPS (worst found cong for the task)
NuPS (expert tuning from Sec. 4.4) AdaPS
6.9x speedup
(7x eective)
NuPS cong 5
NuPS cong 2
0.00
0.05
0.10
0.15
0.20
0 50 100 150 200
250
Run time (minutes)
MRR (ltered)
(a) KGE (quality over time)
0.00
0.05
0.10
0.15
0.20
0 5 10
Epoch
MRR (ltered)
(b) KGE (quality over epoch)
5.7x speedup (6.5x eective)
NuPS cong 1
NuPS cong 4
0
20
40
60
0 50 100 150 200
250
Run time (minutes)
Accuracy
(c) WE (quality over time)
0
20
40
60
0 5
Epoch
Accuracy
(d) WE (quality over epoch)
6.8x speedup (6.6x eective)
NuPS cong 4
NuPS cong 2
2.00
1.00
0.50
0.25
0 50 100 150 200
250
Run time (minutes)
RMSE on test data
(e) MF (quality over time)
2.00
1.00
0.50
0.25
0 20 40
Epoch
RMSE on test data
(f) MF (quality over epoch)
Figure 5.9: Performance of AdaPS and existing PSs on 8 nodes (32 threads per
node). AdaPS matched or even outperformed (tuned) NuPS out of the box
and provided good speedups over the single node baseline.
128
5.4. Experiments
Linear scaling NuPS (best found cong for the task)
NuPS (expert tuning from Sec. 4.4) AdaPS
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
Raw speedup
(a) KGE: raw
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
(b) WE: raw
½
1
2
4
8
16
1 2 4 8 16
Number of nodes
(c) MF: raw
1
2
4
8
16
1 2 4 8 16
Number of nodes
Eective speedup
(d) KGE: effective
1
2
4
8
16
1 2 4 8 16
Number of nodes
(e) WE: effective
1
2
4
8
16
1 2 4 8 16
Number of nodes
(f) MF: effective
Figure 5.10: Strong scaling (logarithmic axes). Raw speedup (a-c), i.e., with re-
spect to epoch run time, and effective speedup (d-f), i.e., with respect to reach-
ing 90% of the best model quality observed on a single node. Runs that did
not reach this threshold within the time limit are not shown.
they were needed, allowing it to synchronize the fewer maintained replicas
more frequently than NuPS. Section 5.4.6 provides detailed insights into how
AdaPS manages parameters and how that differs from NuPS.
The classic PS was inefficient because it requires synchronous network
communication for the majority of parameter accesses.
5.4.3 Scalability
We investigated the scalability of AdaPS and compared it to the best found
and the expertly tuned NuPS configurations. We ran on 2–16 nodes and
measured raw and effective speedups over the shared-memory single node
baseline. Figure 5.10 depicts the results.
AdaPS scaled more efficiently
than NuPS, achieving near linear raw and good effective speedups.
129
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
AdaPS scaled more efficiently than NuPS because increasing the number
of nodes increased the number of relocation conflicts in NuPS. This was
reflected in the share of remote accesses. In KGE, the share of remote accesses
in the best quasi-random NuPS configuration was 1.2%, 2.4%, 3.4%, and
5.3% on 2, 4, 8, and 16 nodes, respectively; in WE, it was 2.4%, 5.4%, 9.9%,
and 15.8%, respectively. NuPS scaled much better for MF because the task’s
locality resulted in a small number of relocation conflicts: 0.3%, 0.3%, 0.4%,
and 1.6%, respectively. In contrast, there were almost no remote parameter ac-
cesses (e.g., <0.0001% for KGE) in AdaPS because AdaPS dynamically created
replicas when multiple nodes concurrently accessed the same parameter.
The measured effective speedups slightly dropped on 16 nodes because
we tuned task hyperparameters (learning rate, regularization, etc.) for the
single node and—to minimize the impact of hyperparameter tuning in our
experiments—used the best single-node hyperparameter settings throughout
all experiments. However, these hyperparameter settings were not optimal
for the 16-node setting (as 16x more worker threads are used). With other
hyperparameters, we observed better effective scalability.
Communication Overhead
In addition, to gain further insights into scalability, we measured the number
of messages and the amount of data that AdaPS exchanges in these experi-
ments. Figure 5.11 depicts the measurements. It depicts (i) the total number
of messages sent among all nodes during one epoch and (ii) the total amount
of data exchanged among all nodes during one epoch. Note that an epoch
on a smaller cluster took longer than on a larger cluster: roughly, an epoch
on 16 nodes took 8x longer than an epoch on 2 nodes, see Figure 5.10. Also
note that these measurements should be viewed with a degree of caution, as
AdaPS communicates as frequently as possible. For example, a larger number
of replicas can reduce the synchronization frequency, so that a small number
of frequently synchronized replicas can use the same network bandwidth as
a larger number of replicas that are synchronized less frequently.
A few observations about these measurements stand out. First, both
the number of messages and the amount of exchanged data increased roughly
linearly (MF) or sub-linearly (KGE) or remained roughly constant (WE) with
increasing cluster size, but never increased super-linearly. Second, there was
no consistent pattern across the tasks: the relationship between communica-
tion and cluster size was different for the different tasks. Third, the number
of messages and the amount of exchanged data seemed to be connected: they
increased and decreased together.
130
5.4. Experiments
AdaPS
1e+00
1e+03
1e+06
1e+09
1e+12
1 2 4 8 16
Number of nodes
Number of messages
(a) KGE: # of messages
1e+00
1e+03
1e+06
1e+09
1e+12
1 2 4 8 16
Number of nodes
(b) WE: # of messages
1e+00
1e+03
1e+06
1e+09
1e+12
1 2 4 8 16
Number of nodes
(c) MF: # of messages
1e+00
1e+03
1e+06
1 2 4 8 16
Number of nodes
Data exchanged (MB)
(d) KGE: data exchanged
1e+00
1e+03
1e+06
1 2 4 8 16
Number of nodes
(e) WE: data exchanged
1e+00
1e+03
1e+06
1 2 4 8 16
Number of nodes
(f) MF: data exchanged
Figure 5.11: Network communication during one epoch: total amount of data
exchanged and total number of messages sent among all nodes.
131
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
We leave a detailed analysis of the communication overhead scaling for
future work and merely speculate on a few key factors here. We speculate that
a key cause for the differences among the tasks is how AdaPS manages each
task. In Section 5.4.6, we will see that AdaPS heavily employs replication for
WE, less, but still a significant amount of replication for KGE, and almost
no replication for MF. We expect the following effects on communication
overhead. First, replication combined with linear scaling and constant syn-
chronization frequency leads to communication constant in the number of
nodes. In more detail: imagine a hotspot parameter for which there is a
replica on all nodes at all times. Whether AdaPS synchronizes 2 replicas
for
t
minutes or 2
·
4 replicas for
t/
4 minutes results in roughly the same
communication overhead (assuming a constant time-based synchronization
frequency). Second, for a parameter that is managed by relocation, com-
munication increases with increasing number of nodes, because the chance
that the parameter is already allocated at the accessing node decreases with
increasing cluster size. For example, in a cluster of 2 nodes, when one of the
two nodes signals intent for a parameter, there is a 50% probability that this
parameter is already allocated at this node (so that no network communica-
tion is required). This probability decreases to 25% with 4 nodes, to 12.5%
with 8 nodes, and to 6.25% with 16 nodes. Note that the differences between
these steps get smaller the larger the cluster. Third, communication overhead
increases when replication is used instead of relocation (see also the following
Section 5.4.4). And the probability that AdaPS employs replication increases
with an increasing number of nodes, because the number of workers that
read and write parameters in parallel increases. Fourth, local sampling (used
in KGE and WE) contributes to a communication overhead that is constant
in the number of nodes because it draws all samples from the local parame-
ter partition, such that it causes no network communication, regardless of
cluster size.
5.4.4 Efficiency of Techniques
The performance of AdaPS is the result of several components. To investi-
gate the contribution of individual management techniques, we ran AdaPS
with different techniques. By default, AdaPS employs relocation and repli-
cation, and chooses automatically which to employ. We built one ablation
variant that restricts AdaPS to replication, i.e., that replicates a parameter
whenever there is active intent (replicate-on-intent), and one that restricts
AdaPS to relocation, i.e., that relocates parameters when there is active intent
(relocate-on-intent). Additionally, we ran AdaPS with static full replication
132
5.4. Experiments
Single node (1x32) Full replication
AdaPS, replicate on intent AdaPS, relocate on intent
AdaPS
0.00
0.05
0.10
0.15
0.20
0 50 100 150 200
250
Run time (minutes)
MRR (ltered)
(a) KGE (quality over time)
0.00
0.05
0.10
0.15
0.20
0 5 10
Epoch
MRR (ltered)
(b) KGE (quality over epoch)
0
20
40
60
0 50 100 150 200
250
Run time (minutes)
Accuracy
(c) WE (quality over time)
0
20
40
60
0 5
Epoch
Accuracy
(d) WE (quality over epoch)
Full replication: out of memory
2.00
1.00
0.50
0.25
0 50 100 150 200
250
Run time (minutes)
RMSE on test data
(e) MF (quality over time)
2.00
1.00
0.50
0.25
0 20 40
Epoch
RMSE on test data
(f) MF (quality over epoch)
Figure 5.12: Performance of AdaPS and ablation variants on 8 nodes.
133
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
(by signaling intent for all parameters on all nodes throughout training).
Figure 5.12 depicts the results.
In summary, relocate-on-intent and full-
replication were inefficient. Replicate-on-intent was efficient for some
tasks. Combining relocation and replication was most efficient.
Static Full Replication
Static full replication provided good (but worse than AdaPS) performance
for WE because the model in WE is smaller than the ones in KGE and MF.
It provided poor model quality in KGE because synchronizing replicas of
the entire model on all nodes caused synchronization frequency to drop
(as network bandwidth is fixed). It ran out of memory for MF because the
model is large (and memory is also required for storing synchronization
deltas, training data, and message buffers).
Relocate on Intent
Restricting AdaPS to relocation provided poor performance for all tasks be-
cause relocation is inefficient for hot spot parameters, as observed previously
in Section 4.4.2.
Replicate on Intent
In contrast, restricting AdaPS to replication provided good run times for
KGE and WE. For MF, it was 3.0x slower than AdaPS because the MF task
exhibits locality (due to row-partitioning, each row parameter is accessed by
only one node) and replication is inefficient for managing locality.
AdaPS
Even for tasks without explicit locality (KGE and WE), combining relocation
and replication (as AdaPS does) was more communication-efficient than
relying exclusively on replication: replicate-on-intent sent 40% more data
over the network for one epoch of KGE, and 46% more for one epoch of
WE. AdaPS was more communication-efficient because relocation is more
efficient than replication if two nodes access a parameter after each other:
relocation can send a parameter directly from the first to the second node,
whereas replication synchronizes via the allocation node.
5.4.5 Effect of Action Timing
To investigate the effect of action timing, we compared AdaPS to an ablation
variant that acts immediately after the intent signal. We ran both variants on
134
5.4. Experiments
AdaPS (with automatic action timing) Immediate action
0
5
10
15
1 16 256 4096 16k
Signal oset (clocks)
Run time (min)
(a) KGE: run time
0
10
20
30
1 16 256 4096 16k
Signal oset (clocks)
(b) WE: run time
0
1
2
3
1 16 256 4096 16k
Signal oset (clocks)
(c) MF: run time
0.00
0.02
0.04
0.06
0.08
1 16 256 4096 16k
Signal oset (clocks)
Model quality
(d) KGE: quality
10
20
30
40
1 16 256 4096 16k
Signal oset (clocks)
(e) WE: quality
1.2
1.4
1.6
1 16 256 4096 16k
Signal oset (clocks)
(f) MF: quality
Figure 5.13: The effect of automatic action timing on epoch run time (a-c) and
model quality after one epoch (d-f). Automatic action timing makes AdaPS
efficient for early signals.
workloads with varying signal offsets. Figure 5.13 depicts the results.
With
automatic action timing, AdaPS was efficient for any sufficiently large
signal offset.
Early Signals
With automatic action timing, AdaPS provided excellent performance for all
largesignal offset values. In contrast, with immediate action, performance was
poor for large signal offsets: run time increased and model quality decreased
(or even collapsed). The reason for this was that the immediate action variant
maintained replicasforlonger than necessary, andcreatedreplicasin situations
in which (the more efficient) relocation was possible. Specifically, quality
decreased because the number of replicas increased drastically, such that
synchronization frequency dropped.
135
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
Late Signals
Smaller relocation offsets improved performance for the immediate action
variant, but didnotfurther improve performance ofAdaPS.Forbothvariants,
epoch run time was poor when intent was signaled so late that the PS had
insufficient time for setting up replicas or relocating parameters (and the
workers had to access the parameters remotely). Thus, with immediate action,
there was a task-specific optimum value for the signal offset. In contrast, with
automatic action timing, large signal offsets provided good performance
across all tasks.
5.4.6 AdaPS in Action
AdaPS decides dynamically and automatically between relocation and repli-
cation. To explore how AdaPS actually manages parameters, we traced the
parameter management of AdaPS. Figure 5.14 depicts parameter manage-
ment for selected parameters during the first half of the first epoch of KGE
training on 8 nodes.
AdaPS managed extreme hot spots and extreme cold
spots the same way a multi-technique PS does, but used more efficient
approaches for parameters between the extremes.
The Extremes
NuPS decides statically how to manage a parameter. The NuPS heuristic
would use static full replication for the first three depicted parameters (Fig-
ures 5.14a, and 5.14b, and 5.14c) and relocation for the others. AdaPS ended
up managing the two extremes—i.e., the extreme hot spot (Figure 5.14a) and
the rarely accessed parameter (Figure 5.14e)—as NuPS would.
Between the Extremes
For the parameters between these two extremes, AdaPS took more fine-
grainedapproachesthanNuPS.Forexample,fortheparameterinFigure5.14b,
AdaPS maintained replicas exactly while they were needed. Concretely, the
gray areas in Figures 5.14b and 5.14c indicate periods in which AdaPS commu-
nicates less than NuPS: in contrast to AdaPS, NuPS would maintain replicas
during these periods. For the parameter in Figure 5.14d, AdaPS created (short-
lived) replicas whenever multiple nodes accessed this parameter concurrently.
These short-lived replicas are barely visible in the figure, so we highlighted
two of them with red boxes. The short-term replicas prevented workers from
having to access the parameter remotely. In contrast, NuPS would manage
136
5.4. Experiments
One node Allocation Replica
(a) Hot spot parameter (most frequent)
(b) Hot spot parameter (99.997% frequency quantile)
(c) Hot spot parameter (99.991% frequency quantile)
(d) Frequent parameter (99.95% frequency quantile)
(e) Typical parameter (median frequency)
Figure 5.14: Parameter management for selected parameters in the KGE task.
Each row corresponds to one of 8 nodes. The red boxes indicate two of the
many hardly visible short-lived replicas.
137
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
(a) Embedding for troubled (99.76% frequency quantile)
(b) Embedding for stays (99.25% frequency quantile)
(c) Embedding for remedy (98.8% frequency quantile)
(d) Embedding for sleeved (97.2% frequency quantile)
(e) Embedding for marra (94.5% frequency quantile)
(f) Embedding for lockwoods (55% frequency quantile)
Figure 5.15: Parameter management for selected parameters in the WE task.
138
5.4. Experiments
(a) Column 488348 (most frequent)
(b) Column 938429 (99.9994% frequency quantile)
(c) Column 662126 (99.9% frequency quantile)
(d) Column 802389 (89.0% frequency quantile)
(e) Column 913931 (55% frequency quantile)
(f) Column 255321 (8% frequency quantile)
Figure 5.16: Parameter management for selected parameters in the MF task.
139
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
this parameter exclusively by relocation, such that workers are slowed down
by remote parameter accesses.
Differences Among ML Tasks
Figures 5.15 and 5.16 depict AdaPS’s parameter management for selected
parameters of the WE and MF tasks, respectively. Again, the figures depict pa-
rameter management during the first half of the first epoch on 8 nodes. In WE,
AdaPS employed more replication than in KGE, i.e., it employed replication
for a larger share of the parameters: AdaPS ended up fully replicating all WE
parameters above the 99.76% frequency quantile. In contrast, in KGE, AdaPS
ended up fully replicating only parameters above the 99.999% frequency
quantile. Also, AdaPS made heavy use of replication for parameters down to
the 98% frequency quantile, e.g., for the embedding of the word remedy, see
Figure 5.15c. The reason for this is that the access frequency distribution in
the WE task is less skewed (i.e., broader) than the one in the KGE task (see
Figure 4.2 on page 66). This matches our findings in Section 4.4.6, where we
found that it is beneficial for NuPS to replicate a larger share of parameters
in the WE task than in the KGE and MF tasks.
In contrast, AdaPS almost never employed replication for MF. The
reason for this was the locality in the MF task: the training algorithm visited
the data points ordered by their column. This means that the algorithm
picks a column and then processes all the data points that belong to this
column, before picking another column. Only rarely did multiple nodes
access the same column at the same time. Thus, AdaPS employed relocation
most of the time and only rarely replication. For example, one of these rare
occurrences is that node 5 and node 8 concurrently accessed column
938429
(Figure 5.16b) for a brief period, which caused AdaPS to maintain a replica
on node 8 for a brief period. In general, over an entire epoch (of which only
half is visible), each parameter was relocated 8 times (once to each node).
11
At each node, frequent parameters were accessed many times (e.g., around
1.1
million times at each node for the most frequent parameter, column
488348
),
and infrequent parameters only a few times (e.g., around
31
times at each
node for column 913931 at the 55% frequency quantile).
11
Of course, if one training data partition contained no data point of a specific column,
then the corresponding parameters were not transferred to this node. Thus, the corresponding
parameters were relocated fewer than 8 times in an epoch.
140
5.5. Summary
5.5 Summary
In this chapter, we explored whether PSs can adapt to the underlying ML task
automatically, without prior tuning. To this end, we proposed intent signal-
ing, a novel mechanism to enable automatic adaptation. And we described
AdaPS, a PS that automatically adapts to the underlying workload based solely
on information that this mechanism provides. In our experimental study,
AdaPS was efficient for multiple ML tasks out of the box, without requir-
ing any tuning, and matched or even outperformed state-of-the-art—more
complex to use—PSs.
141
Chapter 5. Attaining Ease of Use: Automatic Adaptivity
142
Chapter 6
Conclusions
In this thesis, we studied the efficiency of PSs for ML tasks with sparse
parameter access. We observed that existing PSs are inefficient for such tasks:
they barely outperform efficient single node implementations. Starting from
this observation, we investigated whether and to what extent PSs can become
more efficient by adapting to the underlying ML task. Step by step, we made
the PS more adaptive. We explored adapting parameter allocation to exploit
access locality. We explored adapting management techniques to tailor the PS
to the access patterns of individual parameters. And we explored accounting
for the type of parameter access to efficiently support sampling access.
These adaptations can make the PS more efficient, but they also made
the PS more and more complex to use, as the application needs to control
adaptation manually. To reduce usage complexity, we presented an approach
that allows for automatic adaptation, without the applications manual con-
trol. This approach consists of a mechanism to pass relevant information
from the application to the PS, and a PS that adapts automatically based on
this information. Our experiments indicate that our work enables efficient
distributed training for a range of ML tasks with sparse parameter access.
Such efficient distributed training allows researchers and practitioners (i)
to train larger models and (ii) to train models faster. Thus, our contributions
towards efficient PSs enable researchers and practitioners to develop better
ML-based solutions for challenging problems. And our work contributes
to squandering fewer of our planets precious resources, by using them—
in particular, everything that is necessary to produce computers, network
hardware, and data center infrastructure, and the energy necessary to operate
them—more efficiently.
143
Chapter 6. Conclusions
Future Work
There are several interesting research directions based on the work in this
thesis. We list some of these in the following.
Integration into Common ML Systems
ML systems like PyTorch (Paszke et al. 2019) or TensorFlow (Abadi et al.
2016)provide convenientabstractionstodefineand run ML tasks. Aninterest-
ing direction for research is the integration of adaptive parameter management
into these common ML systems. Key research questions in this direction are
(i) whether and how the data loaders of these systems can automatically gener-
ate intent signals from model definitions and (ii) how fine-grained parameter
management can be realized behind coarse-grained tensor parameter defini-
tions. A further aspect in this direction is whether PSs can better integrate
with the optimizer abstractions of these ML systems.
Integration of Hardware Accelerators
The PSs presented in this thesis store model parameters in main memory.
However, especially for deeper models, it is common to use hardware accel-
erators such as GPUs and TPUs for the gradient computation. The local
memory available on these accelerators is more limited than main memory,
and transferring model parameters to accelerator memory can induce signifi-
cant latency. An interesting direction for future work is to investigate how
adaptive PSs can better integrate such accelerators. Research questions in this
direction include (i) how a PS could incorporate accelerator memory (Adnan
et al. 2021) and (ii) how a PS can leverage (heterogeneous) communication
links among the accelerators (e.g., NVLink across (some) of the GPUs on
one node).
Other ML tasks
Throughout this thesis, we evaluated adaptive PSs mostly on shallow embed-
ding models. It would be interesting to apply adaptive PSs to deeper models,
such as deep learning recommender models (H.
-
T. Cheng et al. 2016; Guo
et al. 2017; R. Wang et al. 2017; Zhou et al. 2018), graph neural networks
(Schlichtkrull et al. 2018; Shang et al. 2019; Vashishth et al. 2020), and natural
language processing (Peters et al. 2018; Devlin et al. 2019). In such models,
parts of the model are accessed sparsely (typically the first and sometimes the
last layer), and other parts are accessed densely (e.g., fully connected hidden
layers). Another class of models that is potentially interesting are models
144
with conditional routing (Shazeer et al. 2017; Riquelme et al. 2021; Lepikhin
et al. 2021; Gururangan et al. 2022; Fedus et al. 2022; Margaret Li et al. 2022).
In such models, parts of the model (and, thus, the corresponding parameters)
are activated sparsely. A router component determines which part(s) of the
model are accessed for a specific training example. The key question for
these models is whether intent signaling can be used (e.g., by executing the
routing component during batch preparation) and whether that could make
distributed training more efficient.
Other Intent Signaling PSs
AdaPS is one of many conceivable PSs based on intent signaling. An inter-
esting direction for future work is to further explore the design space of
intent signaling PSs, either by developing fundamentally different systems or
by exploring potential improvements to individual components of AdaPS.
Within AdaPS, it would be interesting to explore parameter management that
makes different decisions for different combinations of intent types. Further,
one could explore what other synchronization and communication protocols
are possible and whether they can further improve communication efficiency.
Or one could develop and incorporate additional, potentially highly tailored,
management techniques or novel sampling schemes.
Other Directions for Improving Efficiency
In this thesis, we explored how a PS can be efficient for a given (i.e., fixed)
workload. Another direction is to try to improve efficiency by manipulating
the workload. One direction in this field is whether explicitly ordering update
steps could improve efficiency, e.g., by arranging update functions in a way
that increases access locality. Another interesting direction is to explore how
adaptive PSs can be combined with work on compression (see Section 2.3.1).
145
Chapter 6. Conclusions
146
Bibliography
Abadi, Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef-
frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael
Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore,
Derek Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete War-
den, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng (2016). “TensorFlow:
A System for Large-scale Machine Learning”. In: Proceedings of the 12th
Conference on Operating Systems Design and Implementation. OSDI ’16.
USENIX Association, pp. 265–283 (cit. on pp. 2, 9, 13, 18, 25, 108, 144).
Adnan, Muhammad, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and
Prashant J. Nair (2021). “Accelerating Recommendation System Training
by Leveraging Popular Choices”. In: PVLDB 15.1, pp. 127–140 (cit. on
pp. 23, 144).
Agarwal, Saurabh, Hongyi Wang, Shivaram Venkataraman, and Dimitris
Papailiopoulos (2022). “On the Utility of Gradient Compression in
Distributed Training Systems”. In: Proceedings of Machine Learning and
Systems. Vol. 4, pp. 652–672 (cit. on p. 22).
Ahmed, Amr, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy,
and Alexander Smola (2012). “Scalable Inference in Latent Variable
Models”. In: Proceedings of the 5th ACM International Conference on
Web Search and Data Mining. WSDM ’12. Association for Computing
Machinery, pp. 123–132 (cit. on pp. 2, 14, 15, 28, 68).
Aji, Alham Fikri and Kenneth Heafield (2017). “Sparse Communication
for Distributed Gradient Descent”. In: Proceedings of the 2017 Confer-
ence on Empirical Methods in Natural Language Processing. EMNLP ’17.
Association for Computational Linguistics, pp. 440–445 (cit. on p. 22).
Alistarh, Dan, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vo-
jnovic (2017). “QSGD: Communication-Efficient SGD via Gradient
Quantization and Encoding”. In: Advances in Neural Information Pro-
cessing Systems. NeurIPS ’17. Curran Associates, pp. 1707–1718 (cit. on
p. 22).
147
Bibliography
Assran, Mahmoud, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat (2019).
“Stochastic Gradient Push for Distributed Deep Learning”. In: Proceedings
of the 36th International Conference on Machine Learning. Ed. byKamalika
Chaudhuri and Ruslan Salakhutdinov. Vol. 97. ICML ’19. PMLR,
pp. 344–353 (cit. on p. 24).
Auer, Sören, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard
Cyganiak, and Zachary Ives (2007). “DBpedia: A nucleus for a web of
open data”. In: The Semantic Web. ISWC ’07. Springer, pp. 722–735
(cit. on p. 49).
Awan, Ammar Ahmad, Khaled Hamidouche, Jahanzeb Maqbool Hashmi,
and Dhabaleswar K. Panda (2017). “S-Caffe: Co-Designing MPI Runtimes
and Caffe for Scalable Deep Learning on Modern GPU Clusters”. In:
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. PPoPP ’17. Association for Computing
Machinery, pp. 193–205 (cit. on p. 23).
Bakshy, Eytan, Lili Dworkin, Brian Karrer, Konstantin Kashin, Benjamin
Letham, Ashwin Murthy, and Shaun Singh (2018). “AE: A domain-
agnostic platform for adaptive experimentation”. In: Workshop on Systems
for ML and Open Source Software at NeurIPS 2018 (cit. on p. 125).
Balazevic, Ivana, Carl Allen, and Timothy Hospedales (2019). “TuckER: Ten-
sor Factorization for Knowledge Graph Completion”. In: Proceedings of
the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Process-
ing. EMNLP-IJCNLP ’19. Association for Computational Linguistics,
pp. 5185–5194 (cit. on pp. 2, 9).
Bamler, Robert and Stephan Mandt (2020). “Extreme Classification via Ad-
versarial Softmax Approximation”. In: Proceedings of the 8th International
Conference on Learning Representations. ICLR ’20 (cit. on pp. 64, 67).
Battiti, Roberto (1989). “Accelerated Backpropagation Learning: Two Op-
timization Methods”. In: Complex systems 3.4, pp. 331–342 (cit. on
p. 89).
Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent (2000). “A Neu-
ral Probabilistic Language Model”. In: Advances in Neural Information
Processing Systems. Vol. 13. MIT Press (cit. on pp. 1, 11).
Bernstein,Jeremy,Yu-XiangWang,KamyarAzizzadenesheli,andAnimashree
Anandkumar (2018). “signSGD: Compressed Optimisation for Non-
Convex Problems”. In: Proceedings of the 35th International Conference
on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80.
ICML ’2018. PMLR, pp. 560–569 (cit. on p. 22).
148
Bibliography
Beutel, Alex, Partha Pratim Talukdar, Abhimanu Kumar, Christos Falout-
sos, Evangelos Papalexakis, and Eric Xing (2014). “FlexiFaCT: Scalable
Flexible Factorization of Coupled Tensors on Hadoop”. In: Proceedings
of the 2014 SIAM International Conference on Data Mining. SDM ’14,
pp. 109–117 (cit. on pp. 27, 30).
Boehm, Matthias, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter,
Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie N. Lind-
staedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui,
and Sebastian Benjamin Wrede (2020). “SystemDS: A Declarative Ma-
chine Learning System for the End-to-End Data Science Lifecycle”. In:
10th Conference on Innovative Data Systems Research. CIDR’ 20 (cit. on
p. 25).
Boehm, Matthias, Michael Dusenberry, Deron Eriksson, Alexandre Evfimiev-
ski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Fred-
erick Reiss, Prithviraj Sen, Arvind Surve, and Shirish Tatikonda (2016).
“SystemML: Declarative Machine Learning on Spark”. In: PVLDB 9.13,
pp. 1425–1436 (cit. on p. 25).
Bordes, Antoine, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston,
and Oksana Yakhnenko (2013). “Translating Embeddings for Modeling
Multi-relational Data”. In: Advances in Neural Information Processing
Systems. NeurIPS ’13. Curran Associates (cit. on pp. 2, 9, 49).
Bottou, Léon, Frank E. Curtis, and Jorge Nocedal (2016). “Optimization
Methods for Large-Scale Machine Learning”. In: CoRR abs/1606.04838
(cit. on p. 8).
Broscheit, Samuel, Daniel Ruffinelli, Adrian Kochsiek, Patrick Betz, and
Rainer Gemulla (2020). “LibKGE - A Knowledge Graph Embedding
Library for Reproducible research”. In: Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing: System Demonstra-
tions. Association for Computational Linguistics, pp. 165–174 (cit. on
pp. 76, 77, 84).
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher
Berner,SamMcCandlish,AlecRadford,IlyaSutskever,andDarioAmodei
(2020). “Language Models are Few-Shot Learners”. In: Advances in Neural
Information Processing Systems. Vol. 33. NeurIPS ’20. Curran Associates,
pp. 1877–1901 (cit. on p. 1).
149
Bibliography
Carbone, Paris, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif
Haridi, and Kostas Tzoumas (2015). “Apache Flink
: Stream and Batch
Processing in a Single Engine”. In: IEEE Data Engineering Bulletin 38.4,
pp. 28–38 (cit. on p. 25).
Chechik, Gal, Varun Sharma, Uri Shalit, and Samy Bengio (2010). “Large
Scale Online Learning of Image Similarity Through Ranking”. In: Journal
of Machine Learning Research 11.36, pp. 1109–1135 (cit. on pp. 64, 67).
Chelba, Ciprian, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants,
and Phillipp Koehn (2013). “One Billion Word Benchmark for Measuring
Progress in Statistical Language Modeling”. In: CoRR abs/1312.3005 (cit.
on pp. 50, 84).
Chen, Jianmin, Rajat Monga, Samy Bengio, and Rafal Józefowicz (2016).
“Revisiting Distributed Synchronous SGD”. In: CoRR abs/1604.00981
(cit. on pp. 10, 11).
Chen, Po-Lung, Chen-Tse Tsai, Yao-Nan Chen, Ku-Chun Chou, Chun-Liang
Li, Cheng-Hao Tsai, Kuan-Wei Wu, Yu-Cheng Chou, Chung-Yi Li, Wei-
Shih Lin, Shu-Hao Yu, Rong-Bing Chiu, Chieh-Yen Lin, Chien-Chih
Wang,Po-Wei Wang,Wei-LunSu, Chen-HungWu, Tsung-TingKuo,Todd
G. McKenzie, Ya-Hsuan Chang, Chun-Sung Ferng, Chia-Mau Ni, Hsuan-
Tien Lin, Chih-Jen Lin, and Shou-De Lin (2012). “A Linear Ensemble
of Individual and Blended Models for Music Rating Prediction”. In:
Proceedings of KDD Cup 2011. Vol. 18. Proceedings of Machine Learning
Research. PMLR, pp. 21–60 (cit. on pp. 1, 2, 9).
Chen, Rong, Jiaxin Shi, Yanzhe Chen, and Haibo Chen (2015). “PowerLyra:
Differentiated Graph Computation and Partitioning on Skewed Graphs”.
In: Proceedings of the 10th European Conference on Computer Systems.
EuroSys ’15. Association for Computing Machinery (cit. on p. 28).
Chen, Tianqi, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tian-
jun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang (2015). “MXNet:
A Flexible and Efficient Machine Learning Library for Heterogeneous
Distributed Systems”. In: CoRR abs/1512.01274 (cit. on pp. 2, 9, 13, 18,
25, 108).
Cheng, Heng-Tze, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chan-
dra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa
Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing
Liu, and Hemal Shah (2016). “Wide & Deep Learning for Recommender
Systems”. In: Proceedings of the 1st Workshop on Deep Learning for Rec-
ommender Systems. DLRS ’16. Association for Computing Machinery,
pp. 7–10 (cit. on pp. 2, 9, 144).
150
Bibliography
Cheng, Wenliang, Chengyu Wang, Bing Xiao, Weining Qian, and Aoying
Zhou (2016). “On Statistical Characteristics of Real-Life Knowledge
Graphs”. In: Big Data Benchmarks, Performance Optimization, and Emerg-
ing Hardware. Springer, pp. 37–49 (cit. on pp. 63, 65).
Chilimbi, Trishul, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanara-
man (2014). “Project Adam: Building an Efficient and Scalable Deep
Learning Training System”. In: Proceedings of the 11th Conference on
Operating Systems Design and Implementation. OSDI ’14. USENIX
Association, pp. 571–582 (cit. on pp. 2, 14).
Ciciani, Bruno, Daniel Dias, and Philip Yu (1990). “Analysis of Replication
in Distributed Database Systems”. In: IEEE Transactions on Knowledge &
Data Engineering 2.02, pp. 247–261 (cit. on p. 70).
Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman (2009). “Power-
Law Distributions in Empirical Data”. In: SIAM Review 51.4, pp. 661–703
(cit. on pp. 63, 65).
Cui, Henggang, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei Dai,
Jesse Haber-Kucharsky, Qirong Ho, Gregory Ganger, Phillip Gibbons,
Garth Gibson, and Eric Xing (2014). “Exploiting Iterative-ness for Parallel
ML Computations”. In: Proceedings of the ACM Symposium on Cloud
Computing. SOCC ’14. Association for Computing Machinery (cit. on
pp. 18, 31, 69).
Dai, Wei, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and
Eric P Xing (2015). “High-Performance Distributed ML at Scale through
Parameter Server Consistency Models”. In: Proceedings of the 29th AAAI
Conference on Artificial Intelligence. AAAI ’15. AAAI Press, pp. 79–87
(cit. on pp. 2, 16, 18–21, 25, 31, 43, 58, 64, 69, 70, 77, 101).
Das, Abhinandan S., Mayur Datar, Ashutosh Garg, and Shyam Rajaram
(2007). “Google News Personalization: Scalable Online Collaborative
Filtering”. In: Proceedings of the 16th International Conference on World
Wide Web. WWW ’07. Association for Computing Machinery, pp. 271–
280 (cit. on p. 1).
Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc
Le, Mark Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke
Yang, and Andrew Ng (2012). “Large Scale Distributed Deep Networks”.
In: Advances in Neural Information Processing Systems. NeurIPS ’12.
Curran Associates, pp. 1223–1231 (cit. on pp. 11, 25).
Demers, Alan, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott
Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry (1987). “Epi-
demic Algorithms for Replicated Database Maintenance”. In: Proceedings
of the 6th Annual ACM Symposium on Principles of Distributed Computing.
151
Bibliography
PODC ’87. Association for Computing Machinery, pp. 1–12 (cit. on
pp. 24, 46).
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019).
“BERT: Pre-training of Deep Bidirectional Transformers for Language Un-
derstanding”. In: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies. NAACL ’19. Association for Computational Linguistics,
pp. 4171–4186 (cit. on pp. 1, 9, 10, 66, 144).
Dowdy, Lawrence W. and Derrell V. Foster (1982). “Comparative Models
of the File Assignment Problem”. In: ACM Computing Surveys 14.2,
pp. 287–313 (cit. on p. 70).
Duchi, John, Elad Hazan, and Yoram Singer (2011). “Adaptive Subgradient
Methods for Online Learning and Stochastic Optimization”. In: Journal
of Machine Learning Research 12, pp. 2121–2159 (cit. on pp. 49, 83, 125).
El Abbadi, Amr (1991). “Adaptive protocols for managing replicated dis-
tributed databases”. In: Proceedings of the 3rd IEEE Symposium on Parallel
and Distributed Processing, pp. 36–43 (cit. on p. 70).
Faloutsos, Michalis, Petros Faloutsos, and Christos Faloutsos (1999). “On
Power-Law Relationships of the Internet Topology”. In: Proceedings of the
Conference on Applications, Technologies, Architectures, and Protocols for
Computer Communication. SIGCOMM ’99. Association for Computing
Machinery, pp. 251–262 (cit. on pp. 63, 65).
Fan, Shiqing, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng,
Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, Lansong Diao, Xi-
aoyong Liu, and Wei Lin (2021). “DAPPLE: A Pipelined Data Parallel
Approach for Training Large Models”. In: Proceedings of the 26th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming.
PPoPP ’21. Association for Computing Machinery, pp. 431–445 (cit. on
p. 23).
Fedus, William, Barret Zoph, and Noam Shazeer (2022). “Switch Trans-
formers: Scaling to Trillion Parameter Models with Simple and Efficient
Sparsity”. In: Journal of Machine Learning Research 23.120, pp. 1–39 (cit.
on p. 145).
Gemulla, Rainer, Erik Nijkamp, Peter Haas, and Yannis Sismanis (2011).
“Large-scale Matrix Factorization with Distributed Stochastic Gradient
Descent”. In: Proceedings of the 17th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining. KDD ’11. Association
for Computing Machinery, pp. 69–77 (cit. on pp. 27, 29, 30, 46, 48, 87).
Ghoting, Amol, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Rein-
wald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivaku-
152
Bibliography
mar Vaithyanathan (2011). “SystemML: Declarative Machine Learning
on MapReduce”. In: Proceedings of the 2011 IEEE 27th International
Conference on Data Engineering. ICDE ’11. IEEE Computer Society,
pp. 231–242 (cit. on p. 25).
Gonzalez, Joseph, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos
Guestrin (2012). “PowerGraph: Distributed Graph-parallel Computation
on Natural Graphs”. In: Proceedings of the 10th Conference on Operating
Systems Design and Implementation. OSDI ’12. USENIX Association,
pp. 17–30 (cit. on pp. 27, 28, 63, 65).
Goyal, Priya, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Weso-
lowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He
(2017). “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”.
In: CoRR abs/1706.02677 (cit. on p. 22).
Grover, Aditya and Jure Leskovec (2016). “Node2vec: Scalable Feature Learn-
ing for Networks”. In: Proceedings of the 22nd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining. KDD ’16.
Association for Computing Machinery, pp. 855–864 (cit. on pp. 64, 65,
67).
Guo, Huifeng, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He
(2017). “DeepFM: A Factorization-Machine Based Neural Network for
CTR Prediction”. In: Proceedings of the 26th International Joint Conference
on Artificial Intelligence. IJCAI’17. AAAI Press, pp. 1725–1731 (cit. on
pp. 2, 9, 144).
Gururangan, Suchin, Mike Lewis, Ari Holtzman, Noah A. Smith, and Luke
Zettlemoyer(2022). “DEMix Layers:DisentanglingDomainsforModular
Language Modeling”. In: Proceedings of the 2022 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies. NAACL ’22. Association for Computational
Linguistics, pp. 5557–5576 (cit. on p. 145).
Ho, Qirong, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip
Gibbons, Garth Gibson, Gregory Ganger, and Eric Xing (2013). “More
Effective Distributed ML via a Stale Synchronous Parallel Parameter
Server”. In: Advances in Neural Information Processing Systems. NeurIPS
’13. Curran Associates, pp. 1223–1231 (cit. on pp. 2, 14, 16, 18, 19, 25,
43, 47, 58, 64, 69, 73, 77, 101, 103).
Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-Term Mem-
ory”. In: Neural Computation 9.8, pp. 1735–1780 (cit. on p. 1).
Howard, Jeremy and Sebastian Ruder (2018). “Universal Language Model
Fine-tuning for Text Classification”. In: Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics. ACL ’18. Associ-
153
Bibliography
ation for Computational Linguistics, pp. 328–339 (cit. on pp. 2, 9, 10,
66).
Hu, Yifan, Yehuda Koren, and Chris Volinsky (2008). “Collaborative Filter-
ing for Implicit Feedback Datasets”. In: 8th IEEE International Conference
on Data Mining. ICDM ’08. IEEE Computer Society, pp. 263–272 (cit. on
pp. 2, 9).
Huang, Yanping, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen,
Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu,
and zhifeng Chen (2019). “GPipe: Efficient Training of Giant Neural
Networks using Pipeline Parallelism”. In: Advances in Neural Information
Processing Systems. Vol. 32. NeurIPS ’19. Curran Associates (cit. on
p. 23).
Huang, Yuzhen, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang,
Jinfeng Li, Yuying Guo, and James Cheng (2018). “FlexPS: Flexible
Parallelism Control in Parameter Server Architecture”. In: PVLDB 11.5,
pp. 566–579 (cit. on pp. 2, 14, 18, 25, 29, 60, 69, 77).
Huang, Yuzhen, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su, Shivam
Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, and Jack Langman
(2021). “Hierarchical Training: Scaling Deep Recommendation Models
on Large CPU Clusters”. In: Proceedings of the 27th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. KDD
’21. Association for Computing Machinery, pp. 3050–3058 (cit. on p. 23).
Hubara, Itay, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio (2017). “Quantized Neural Networks: Training Neural
Networks with Low Precision Weights and Activations”. In: Journal of
Machine Learning Research 18.1, pp. 6869–6898 (cit. on p. 22).
Huo, Zhouyuan, Bin Gu, and Heng Huang (2021). “Large Batch Optimiza-
tion for Deep Learning Using New Complete Layer-Wise Adaptive Rate
Scaling”. In: Proceedings of the 35th AAAI Conference on Artificial Intelli-
gence. AAAI ’21 35.9, pp. 7883–7890 (cit. on p. 23).
Hutto, Phillip and Mustaque Ahamad (1990). “Slow memory: weakening
consistency to enhance concurrency in distributed shared memories”. In:
Proceedings of the 10th International Conference on Distributed Computing
Systems. ICDCS ’90, pp. 302–309 (cit. on p. 43).
Jagerman, Rolf, Carsten Eickhoff, and Maarten de Rijke (2017). “Computing
Web-scale Topic Models Using an Asynchronous Parameter Server”. In:
Proceedings of the 40th International ACM SIGIR Conference on Research
and Development in Information Retrieval. SIGIR ’17. Association for
Computing Machinery, pp. 1337–1340 (cit. on pp. 2, 14, 25, 29).
154
Bibliography
Ji, S., N. Satish, S. Li, and P. K. Dubey (2019). “Parallelizing Word2Vec in
Shared and Distributed Memory”. In: IEEE Transactions on Parallel and
Distributed Systems 30.9, pp. 2090–2100 (cit. on pp. 64, 73, 76, 77).
Jiang, Jiawei, Bin Cui, Ce Zhang, and Lele Yu (2017). “Heterogeneity-Aware
Distributed Parameter Servers”. In: Proceedings of the 2017 ACM Interna-
tional Conference on Management of Data. SIGMOD ’17. Association for
Computing Machinery, pp. 463–478 (cit. on pp. 2, 14, 18, 24, 25, 29, 69).
Jiang, Yimin, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong
Guo (2020). “A Unified Architecture for Accelerating Distributed DNN
Training in Heterogeneous GPU/CPU Clusters”. In: Proceedings of the
14th Conference on Operating Systems Design and Implementation. OSDI
’20. USENIX Association, pp. 463–479 (cit. on pp. 2, 14, 17, 23–25).
Kalia, Anuj, Michael Kaminsky, and David G. Andersen (2016). “Design
Guidelines for High Performance RDMA Systems”. In: Proceedings of
the 2016 USENIX Annual Technical Conference. USENIX ’16. USENIX
Association, pp. 437–450 (cit. on p. 24).
Kazemi, Seyed Mehran and David Poole (2018). “SimplE Embedding for Link
Prediction in Knowledge Graphs”. In: Advances in Neural Information
Processing Systems. NeurIPS ’18. Curran Associates (cit. on pp. 2, 9).
Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyan-
skiy, and Ping Tak Peter Tang (2017). “On Large-Batch Training for Deep
Learning: Generalization Gap and Sharp Minima”. In: Proceedings of
the 5th International Conference on Learning Representations. ICLR ’17
(cit. on p. 22).
Kim, Jin Kyu, Abutalib Aghayev, Garth Gibson, and Eric Xing (2019).
“STRADS-AP: Simplifying Distributed Machine Learning Programming
without Introducing a New Programming Model”. In: Proceedings of
the 2019 USENIX Annual Technical Conference. USENIX ’19. USENIX
Association, pp. 207–222 (cit. on p. 14).
Kim,JinKyu,QirongHo,SeunghakLee,XunZheng,WeiDai,GarthGibson,
andEric Xing (2016). “STRADS:A DistributedFrameworkforScheduled
Model Parallel Machine Learning”. In: Proceedings of the 11th European
Conference on Computer Systems. EuroSys’16. Association forComputing
Machinery (cit. on pp. 2, 13).
Kim, Soojeong, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong,
Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun (2019).
“Parallax: Sparsity-Aware Data Parallel Training of Deep Neural Net-
works”. In: Proceedings of the 14th European Conference on Computer
Systems. EuroSys ’19. Association for Computing Machinery (cit. on
pp. 70, 106, 107).
155
Bibliography
Kochsiek, Adrian and Rainer Gemulla (2021). “Parallel Training of Knowl-
edge Graph Embedding Models: A Comparison of Techniques”. In:
PVLDB 15.3, pp. 633–645 (cit. on pp. 76, 84).
Koren, Yehuda, Robert Bell, and Chris Volinsky (2009). “Matrix Factoriza-
tion Techniques for Recommender Systems”. In: Computer 42.8, pp. 30–
37 (cit. on pp. 1, 2, 9, 48, 63, 65).
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton (2012). “ImageNet
Classification with Deep Convolutional Neural Networks”. In: Advances
in Neural Information Processing Systems. Vol. 25. NeurIPS ’12. Curran
Associates (cit. on p. 1).
Lamport, Leslie (1979). “How to Make a Multiprocessor Computer That
Correctly Executes Multiprocess Programs”. In: IEEE Transactions on
Computers 28.9, pp. 690–691 (cit. on pp. 16, 43).
LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
and L. D. Jackel (1989). “Backpropagation Applied to Handwritten Zip
Code Recognition”. In: Neural Computation 1.4, pp. 541–551 (cit. on
pp. 1, 10, 17, 66).
Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan
Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen
(2021). “GShard: Scaling Giant Models with Conditional Computation
and Automatic Sharding”. In: Proceedings of the 9th International Confer-
ence on Learning Representations. ICLR ’21 (cit. on p. 145).
Lerer, Adam, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt,
Abhijit Bose, and Alex Peysakhovich (2019). “Pytorch-BigGraph: A Large
Scale Graph Embedding System”. In: Proceedings of Machine Learning
and Systems. Vol. 1. MLSys ’19, pp. 120–131 (cit. on pp. 2, 13, 27, 30, 64,
73, 76, 77, 99).
Li, Ang, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R.
Tallent, and Kevin J. Barker (2020). “Evaluating Modern GPU Intercon-
nect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect”. In: IEEE
Transactions on Parallel and Distributed Systems 31.1, pp. 94–110 (cit. on
p. 24).
Li, Margaret, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff,
Noah A. Smith, and Luke Zettlemoyer (2022). “Branch-Train-Merge:
Embarrassingly Parallel Training of Expert Language Models”. In: CoRR
abs/2208.03306 (cit. on p. 145).
Li, Mu, David Andersen, Jun Woo Park, Alexander Smola, Amr Ahmed,
Vanja Josifovski, James Long, Eugene Shekita, and Bor-Yiing Su (2014a).
“Scaling Distributed Machine Learning with the Parameter Server”. In:
Proceedings of the 11th Conference on Operating Systems Design and Imple-
156
Bibliography
mentation. OSDI ’14. USENIX Association, pp. 583–598 (cit. on pp. 2,
14, 15, 25, 29, 34, 35, 43, 47, 51, 68, 85, 99, 125, 126).
Li, Mu, David Andersen, Alexander Smola, and Kai Yu (2014b). “Com-
munication Efficient Distributed Machine Learning with the Parameter
Server”. In: Advances in Neural Information Processing Systems. NeurIPS
’14. MIT Press, pp. 19–27 (cit. on pp. 22, 25).
Li, Shen, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis,
Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania,
and Soumith Chintala (2020). “PyTorch Distributed: Experiences on
Accelerating Data Parallel Training”. In: PVLDB 13.12, pp. 3005–3018
(cit. on pp. 17, 18, 25).
Lian, Xiangru, Yijun Huang, Yuncheng Li, and Ji Liu (2015). “Asynchronous
Parallel Stochastic Gradient for Nonconvex Optimization”. In: Advances
in Neural Information Processing Systems. NeurIPS ’15. MIT Press,
pp. 2737–2745 (cit. on p. 73).
Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji
Liu (2017). “Can Decentralized Algorithms Outperform Centralized
Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient
Descent”. In: Advances in Neural Information Processing Systems. Vol. 30.
NeurIPS ’17. Curran Associates (cit. on pp. 24, 25).
Lin, Yujun, Song Han, Huizi Mao, Yu Wang, and Bill Dally (2018). “Deep
Gradient Compression: Reducing the Communication Bandwidth for
Distributed Training”. In: Proceesings of the 6th International Conference
on Learning Representations. ICLR ’18 (cit. on p. 22).
Lipton, Richard J and Jonathan S Sandberg (1988). PRAM: A scalable shared
memory. Tech. rep. Princeton University, Department of Computer
Science (cit. on p. 43).
Liu, Hanxiao, Yuexin Wu, and Yiming Yang (2017). “Analogical Inference for
Multi-relational Embeddings”. In: Proceedings of the 34th International
Conference on Machine Learning. ICML ’17. PMLR, pp. 2168–2178
(cit. on pp. 49, 56, 64, 67, 83, 99).
Liu,Xiaodong, YelongShen, KevinDuh, andJianfengGao (2018). “Stochastic
Answer Networks for Machine Reading Comprehension”. In: Proceedings
of the 56th Annual Meeting of the Association for Computational Linguistics.
ACL ’18. Association for Computational Linguistics, pp. 1694–1704
(cit. on p. 50).
Low, Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo
Kyrola, and Joseph Hellerstein (2012). “Distributed GraphLab: A Frame-
work for Machine Learning and Data Mining in the Cloud”. In: PVLDB
5.8, pp. 716–727 (cit. on pp. 25, 27, 28, 70).
157
Bibliography
Makari, Faraz, Christina Teflioudi, Rainer Gemulla, Peter Haas, and Yannis
Sismanis (2015). “Shared-memory and shared-nothing stochastic gra-
dient descent algorithms for matrix completion”. In: Knowledge and
Information Systems 42.3, pp. 493–523 (cit. on pp. 48, 84, 89).
Malewicz, Grzegorz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn,
Naty Leiser, and Grzegorz Czajkowski (2010). “Pregel: A System for
Large-scale Graph Processing”. In: Proceedings of the 2010 ACM SIG-
MOD International Conference on Management of Data. SIGMOD ’10.
Association for Computing Machinery, pp. 135–146 (cit. on p. 25).
Masters, Dominic and Carlo Luschi (2018). “Revisiting Small Batch Training
for Deep Neural Networks”. In: CoRR abs/1804.07612 (cit. on p. 22).
Meka, Raghu, Prateek Jain, and Inderjit Dhillon (2009). “Matrix Completion
from Power-Law Distributed Samples”. In: Advances in Neural Informa-
tion Processing Systems. NeurIPS ’09. Curran Associates, pp. 1258–1266
(cit. on pp. 48, 63, 65).
Miao, Xupeng, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao
Ma, and Bin Cui (2021). “Heterogeneity-Aware Distributed Machine
Learning Training via Partial Reduce”. In: Proceedings of the 2021 Interna-
tional Conference on Management of Data. SIGMOD ’21. Association for
Computing Machinery, pp. 2262–2270 (cit. on p. 24).
Miao, Xupeng, Yining Shi, Hailin Zhang, Xin Zhang, Xiaonan Nie, Zhi
Yang, and Bin Cui (2022). “HET-GMP: A Graph-Based System Approach
to Scaling Large Embedding Model Training”. In: Proceedings of the
2022 International Conference on Management of Data. SIGMOD ’22.
Association for Computing Machinery, pp. 470–480 (cit. on p. 23).
Mikolov,Tomas, Kai Chen, GregCorrado,and JeffreyDean (2013). “Efficient
Estimation of Word Representations in Vector Space”. In: Proceedings of
the 1st International Conference on Learning Representations. ICLR ’13
(cit. on pp. 1, 2, 9, 12, 50, 51, 56, 63–65, 67, 68, 74, 77, 84, 99).
Min, Seung Won, Kun Wu, Sitao Huang, Mert Hidayeto˘
glu, Jinjun Xiong,
Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu (2021). “Large
Graph Convolutional Network Training with GPU-Oriented Data Com-
munication Architecture”. In: PVLDB 14.11, pp. 2087–2100 (cit. on
p. 23).
Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill (cit. on p. 7).
Mohamed, Aisha, Shameem Parambath, Zoi Kaoudi, and Ashraf Aboulnaga
(2020). “Popularity Agnostic Evaluation of Knowledge Graph Embed-
dings”. In: Proceedings of the 36th Conference on Uncertainty in Artificial
Intelligence. UAI ’20. PMLR, pp. 1059–1068 (cit. on p. 65).
158
Bibliography
Moreno-Sánchez,Isabel,FrancescFont-Clos,andÁlvaroCorral(2016). “Large-
Scale Analysis of Zipf’s Law in English Texts”. eng. In: PloS one 11.1
(cit. on pp. 63, 65).
Nakandala, Supun, Yuhao Zhang, and Arun Kumar (2019). “Cerebro: Effi-
cient and Reproducible Model Selection on Deep Learning Systems”. In:
Proceedings of the 3rd International Workshop on Data Management for
End-to-End Machine Learning. DEEM ’19. Association for Computing
Machinery, 6:1–6:4 (cit. on pp. 27, 30).
Narayanan, Deepak, Aaron Harlap, Amar Phanishayee, Vivek Seshadri,
Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei
Zaharia (2019). “PipeDream: Generalized Pipeline Parallelism for DNN
Training”. In: Proceedings of the 27th ACM Symposium on Operating Sys-
tems Principles. SOSP ’19. Association for Computing Machinery, pp. 1–
15 (cit. on p. 23).
Németh, Gábor, Dániel Géhberger, and Péter Mátray (2017). “DAL: A
Locality-Optimizing Distributed Shared Memory System”. In: 9th USE-
NIX Workshop on Hot Topics in Cloud Computing (HotCloud 17). Hot-
Cloud ’17. USENIX Association (cit. on p. 60).
Nickel, Maximilian, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich
(2016a). “A Review of Relational Machine Learning for Knowledge
Graphs”. In: Proceedings of the IEEE 104.1, pp. 11–33 (cit. on p. 49).
Nickel, Maximilian, Lorenzo Rosasco, and Tomaso Poggio (2016b). “Holo-
graphic Embeddings of Knowledge Graphs”. In: Proceedings of the 30th
AAAI Conference on Artificial Intelligence. AAAI ’16. AAAI Press,
pp. 1955–1961 (cit. on p. 49).
Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel (2011). “A Three-
way Model for Collective Learning on Multi-relational Data”. In: Pro-
ceedings of the 28th International Conference on Machine Learning. ICML
’11. Omnipress, pp. 809–816 (cit. on pp. 2, 9, 49, 63, 65).
Niu, Feng, Benjamin Recht, Christopher Re, and Stephen Wright (2011).
“HOGWILD!: A Lock-free Approach to Parallelizing Stochastic Gradient
Descent”. In: Advances in Neural Information Processing Systems. NeurIPS
’11. Curran Associates, pp. 693–701 (cit. on pp. 11, 12, 40).
Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio (2013). “On the Diffi-
culty of Training Recurrent Neural Networks”. In: Proceedings of the 30th
International Conference on Machine Learning. ICML ’13. JMLR.org,
pp. 1310–1318 (cit. on p. 85).
Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,
Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito,
159
Bibliography
Martin Raison,AlykhanTejani,SasankChilamkurthy,Benoit Steiner,Lu
Fang, Junjie Bai, and Soumith Chintala (2019). “PyTorch: An Imperative
Style, High-Performance Deep Learning Library”. In: Advances in Neural
Information Processing Systems. NeurIPS ’19. Curran Associates (cit. on
pp. 2, 9, 25, 108, 144).
Patterson, David A. (2004). “Latency Lags Bandwith”. In: Communications
of the ACM 47.10, pp. 71–75 (cit. on p. 24).
Peng, Bo, Bingjing Zhang, Langshi Chen, Mihai Avram, Robert Henschel,
CraigStewart,ShaojuanZhu,EmilyMccallum,LisaSmith,TomZahniser,
et al. (2017). “HarpLDA+: Optimizing latent dirichlet allocation for
parallel efficiency”. In: 2017 IEEE International Conference on Big Data.
BigData ’17. IEEE Computer Society, pp. 243–252 (cit. on pp. 27, 30).
Peng, Jingshu, Zhao Chen, Yingxia Shao, Yanyan Shen, Lei Chen, and Jian-
nong Cao (2022). “Sancus: Staleness-Aware Communication-Avoiding
Full-Graph Decentralized Training in Large-Scale Graph Neural Net-
works”. In: PVLDB 15.9, pp. 1937–1950 (cit. on p. 23).
Pennington,Jeffrey,RichardSocher,andChristopherManning(2014). “GloVe:
Global Vectors for Word Representation”. In: Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing. EMNLP
’14. Association for Computational Linguistics, pp. 1532–1543 (cit. on
pp. 2, 9, 50).
Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
Clark, Kenton Lee, and Luke Zettlemoyer (2018). “Deep Contextualized
Word Representations”. In: Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies. NAACL ’18. Association for Computational
Linguistics, pp. 2227–2237 (cit. on pp. 2, 9, 10, 50, 66, 67, 144).
Raman, Parameswaran, Sriram Srinivasan, Shin Matsushima, Xinhua Zhang,
Hyokun Yun, and S.V.N. Vishwanathan (2019). “Scaling Multinomial
Logistic Regression via Hybrid Parallelism”. In: Proceedings of the 25th
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. KDD ’19. Association for Computing Machinery, pp. 1460–1470
(cit. on pp. 27, 30).
Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss,
Alec Radford, Mark Chen, and Ilya Sutskever (2021). “Zero-Shot Text-to-
Image Generation”. In: Proceedings of the 38th International Conference on
Machine Learning. Vol. 139. Proceedings of Machine Learning Research.
PMLR, pp. 8821–8831 (cit. on pp. 1, 10).
Ratnasamy, Sylvia, Paul Francis, Mark Handley, Richard Karp, and Scott
Shenker (2001). “A Scalable Content-addressable Network”. In: Proceed-
160
Bibliography
ings of the 2001 Conference on Applications, Technologies, Architectures, and
Protocols for Computer Communications. SIGCOMM ’01. Association
for Computing Machinery, pp. 161–172 (cit. on p. 46).
Rawat, Ankit Singh, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep
Jayasumana, Felix X. Yu, Sashank J. Reddi, and Sanjiv Kumar (2021).
“Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces”. In: CoRR abs/2105.05736 (cit. on p. 67).
ˇ
Reh˚
uˇ
rek, Radim and Petr Sojka (2010). “Software Framework for Topic
Modelling with Large Corpora”. English. In: Proceedings of the LREC
2010 Workshop on New Challenges for NLP Frameworks. LREC’10. ELRA,
pp. 45–50 (cit. on pp. 50, 84, 99).
Rendle, Steffen, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-
Thieme (2009). “BPR: Bayesian Personalized Ranking from Implicit
Feedback”. In: Proceedings of the 25th Conference on Uncertainty in Ar-
tificial Intelligence. UAI ’09. AUAI Press, pp. 452–461 (cit. on pp. 64,
67).
Renz-Wieland, Alexander, Tobias Drobisch, Zoi Kaoudi, Rainer Gemulla,
and Volker Markl (2021). “Just Move It! Dynamic Parameter Allocation
in Action”. In: PVLDB 14.12, pp. 2707–2710 (cit. on p. 4).
Renz-Wieland, Alexander, Rainer Gemulla, Zoi Kaoudi, and Volker Markl
(2022a). “NuPS: A Parameter Server for Machine Learning with Non-
Uniform ParameterAccess”. In:Proceedings of the 2022 ACM International
Conference on Management of Data. SIGMOD ’22. Association for
Computing Machinery (cit. on p. 4).
Renz-Wieland, Alexander, Rainer Gemulla, Steffen Zeuch, and Volker Markl
(2020). “Dynamic Parameter Allocation in Parameter Servers”. In:
PVLDB 13.12, pp. 1877–1890 (cit. on p. 4).
Renz-Wieland, Alexander, Andreas Kieslinger, Robert Gericke, Rainer Ge-
mulla, Zoi Kaoudi, and Volker Markl (2022b). “Good Intentions: Adap-
tive Parameter Servers via Intent Signaling”. In: CoRR abs/2206.00470
(cit. on p. 4).
Riquelme, Carlos, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodol-
phe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby
(2021). “Scaling Vision with Sparse Mixture of Experts”. In: Advances in
Neural Information Processing Systems. NeurIPS ’21. Curran Associates,
pp. 8583–8595 (cit. on pp. 1, 145).
Rowstron, Antony and Peter Druschel (2001). “Pastry: Scalable, Decentral-
ized Object Location, and Routing for Large-Scale Peer-to-Peer Systems”.
In: Proceedings of the IFIP/ACM International Conference on Distributed
161
Bibliography
Systems Platforms Heidelberg. Middleware ’01. Springer, pp. 329–350
(cit. on p. 46).
Ruder, Sebastian (2016). “An overview of gradient descent optimization
algorithms”. In: CoRR abs/1609.04747 (cit. on p. 7).
Ruffinelli, Daniel, Samuel Broscheit, and Rainer Gemulla (2020). “You
CAN Teach an Old Dog New Tricks! On Training Knowledge Graph
Embeddings”. In: Proceedings of the 8th International Conference on
Learning Representations. ICLR ’20 (cit. on pp. 49, 56, 64, 67, 74, 83, 84).
Schlichtkrull, Michael, Thomas N. Kipf, Peter Bloem, Rianne van den Berg,
Ivan Titov, and Max Welling (2018). “Modeling Relational Data with
Graph Convolutional Networks”. In: The Semantic Web. ESWC ’18.
Springer, pp. 593–607 (cit. on pp. 2, 9, 144).
Schroff, F., D. Kalenichenko, and J. Philbin (2015). “FaceNet: A unified
embedding for face recognition and clustering”. In: 2015 IEEE Conference
on Computer Vision and Pattern Recognition. CVPR ’15. IEEE Computer
Society, pp. 815–823 (cit. on pp. 64, 67).
Seide, Frank, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu (2014). “1-Bit
Stochastic Gradient Descent and Application to Data-Parallel Distributed
Training of Speech DNNs”. In: Proceedings of the 15th Annual Conference
of the International Speech Communication Association. Interspeech ’14,
pp. 1058–1062 (cit. on p. 22).
Sergeev, Alexander and Mike Del Balso (2018). “Horovod: fast and easy
distributed deep learning in TensorFlow”. In: CoRR abs/1802.05799
(cit. on pp. 2, 17, 23).
Shallue,ChristopherJ.,JaehoonLee,JosephAntognini,JaschaSohl-Dickstein,
Roy Frostig, and George E. Dahl (2019). “Measuring the Effects of Data
ParallelismonNeuralNetworkTraining”. In:Journal of Machine Learning
Research 20.112, pp. 1–49 (cit. on p. 22).
Shang, Chao, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen
Zhou (2019). “End-to-End Structure-Aware Convolutional Networks
for Knowledge Base Completion”. In: Proceedings of the 33rd AAAI
Conference on Artificial Intelligence. AAAI ’19. AAAI Press, pp. 3060–
3067 (cit. on pp. 2, 9, 144).
Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc
Le, Geoffrey Hinton, and Jeff Dean (2017). “Outrageously Large Neural
Networks:TheSparsely-GatedMixture-of-Experts Layer”. In:Proceedings
of the 5th International Conference on Learning Representations. ICLR ’17
(cit. on p. 145).
162
Bibliography
Shi, Baoxu and Tim Weninger (2018). “Open-World Knowledge Graph
Completion”. In: Proceedings of the 32nd AAAI Conference on Artificial
Intelligence. AAAI ’18. AAAI Press (cit. on p. 49).
Simonyan, Karen and Andrew Zisserman (2015). “Very Deep Convolutional
Networks for Large-Scale Image Recognition”. In: Proceedings of the 3rd
International Conference on Learning Representations. ICLR ’15 (cit. on
p. 1).
Smith, Alan (1982). “Cache Memories”. In: ACM Computing Surveys 14.3,
pp. 473–530 (cit. on p. 31).
Smola, Alexander and Shravan Narayanamurthy (2010). “An Architecture
for Parallel Topic Models”. In: PVLDB 3.1-2, pp. 703–710 (cit. on pp. 2,
14, 15, 25, 28, 68).
Socher, Richard, John Bauer, Christopher Manning, and Andrew Ng (2013).
“Parsing with Compositional Vector Grammars”. In: Proceedings of the
51st Annual Meeting of the Association for Computational Linguistics. ACL
’13. Association for Computational Linguistics, pp. 455–465 (cit. on
p. 50).
Steen, Maarten van and Andrew Tanenbaum (2017). Distributed Systems. 3rd
(cit. on pp. 31, 35, 46, 121).
Stergiou, Stergios, Zygimantas Straznickas, Rolina Wu, and Kostas Tsiout-
siouliklis (2017). “Distributed Negative Sampling for Word Embeddings”.
In:Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI
’17. AAAI Press, pp. 2569–2575 (cit. on pp. 64, 73, 76).
Stoica, Ion, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakr-
ishnan (2001). “Chord: A Scalable Peer-to-peer Lookup Service for Inter-
net Applications”. In: Proceedings of the 2001 Conference on Applications,
Technologies, Architectures, and Protocols for Computer Communications.
SIGCOMM ’01. Association for Computing Machinery, pp. 149–160
(cit. on p. 46).
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le (2014). “Sequence to Sequence
Learning with Neural Networks”. In: Advances in Neural Information
Processing Systems. Vol. 27. NeurIPS ’14. Curran Associates (cit. on p. 1).
Tang, Hanlin, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu (2018).
D2
:
Decentralized Training over Decentralized Data”. In: Proceedings of the
35th International Conference on Machine Learning. Ed. by Jennifer Dy
and Andreas Krause. Vol. 80. ICML ’18. PMLR, pp. 4848–4856 (cit. on
p. 24).
Teflioudi, Christina, Faraz Makari, and Rainer Gemulla (2012). “Distributed
Matrix Completion”. In: 12th IEEE International Conference on Data
163
Bibliography
Mining. ICDM ’12. IEEE Computer Society, pp. 655–664 (cit. on pp. 27,
30, 31, 46, 84, 98).
Thakur,Rajeevand WilliamD.Gropp(2003). “ImprovingthePerformance of
Collective Operations in MPICH”. In: Recent Advances in Parallel Virtual
Machine and Message Passing Interface. EuroPVM/MPI ’03. Springer,
pp. 257–267 (cit. on p. 73).
Träff, Jesper Larsson (2010). “Transparent Neutral Element Elimination in
MPI Reduction Operations”. In: Recent Advances in the Message Passing
Interface. Springer, pp. 275–284 (cit. on p. 72).
Trouillon, Théo, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guil-
laume Bouchard (2016). “Complex Embeddings for Simple Link Predic-
tion”. In: Proceedings of the 33rd International Conference on Machine
Learning. ICML ’16. JMLR.org, pp. 2071–2080 (cit. on pp. 2, 9, 49, 83).
Vashishth, Shikhar, Soumya Sanyal, Vikram Nitin, and Partha Talukdar
(2020). “Composition-based Multi-Relational Graph Convolutional Net-
works”. In: Proceedings of the 8th International Conference on Learning
Representations. ICLR ’20 (cit. on pp. 2, 9, 144).
Wang, Guanhua, Shivaram Venkataraman, Amar Phanishayee, Nikhil De-
vanur, Jorgen Thelin, and Ion Stoica (2020). “Blink: Fast and Generic
Collectives for Distributed ML”. In: Proceedings of Machine Learning and
Systems. Vol. 2. MLSys ’20, pp. 172–186 (cit. on p. 24).
Wang, Hongyi, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris
Papailiopoulos, and Stephen Wright (2018). “ATOMO: Communication-
efficient Learning via Atomic Sparsification”. In: Advances in Neural
Information Processing Systems. Vol. 31. NeurIPS ’18. Curran Associates
(cit. on p. 22).
Wang, Ruoxi, Bin Fu, Gang Fu, and Mingliang Wang (2017). “Deep & Cross
Network for Ad Click Predictions”. In: Proceedings of the ADKDD’17.
ADKDD’17. Association for Computing Machinery (cit. on pp. 2, 9,
144).
Wang, Shuai, Dan Li, Jinkun Geng, Yue Gu, and Yang Cheng (2019). “Impact
of Network Topology on the Performance of DML: Theoretical Analysis
and Practical Factors”. In: Proceedings of the 2019 IEEE International Con-
ference on Computer Communications. INFOCOM ’19. IEEE Computer
Society, pp. 1729–1737 (cit. on p. 23).
Wang, Xiaozhi, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and
Jian Tang (2019). “KEPLER: A Unified Model for Knowledge Embedding
and Pre-trained Language Representation”. In: CoRR abs/1911.06136
(cit. on p. 83).
164
Bibliography
Wen, Wei, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen,
and Hai Li (2017). “TernGrad: Ternary Gradients to Reduce Communica-
tion in Distributed Deep Learning”. In: Advances in Neural Information
Processing Systems. Vol. 30. NeurIPS ’17. Curran Associates (cit. on
p. 22).
Wolfson, Ouri and Sushil Jajodia (1992). “Distributed Algorithms for Dy-
namic Replication of Data”. In: Proceedings of the 11fh ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems. PODS
’92. Association for Computing Machinery, pp. 149–163 (cit. on p. 70).
Xing, Eric, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee,
Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu (2015).
“Petuum: A New Platform for Distributed Machine Learning on Big
Data”. In: Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. KDD ’15. Association for
Computing Machinery, pp. 1335–1344 (cit. on pp. 29, 40, 43, 51, 85, 103,
108, 122).
Yang, Bishan, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng (2015).
“Embedding Entities and Relations for Learning and Inference in Knowl-
edgeBases”. In:Proceedings of the 3rd International Conference on Learning
Representations. ICLR ’15 (cit. on p. 49).
Yang,Bowen,Jian Zhang, JonathanLi, ChristopherRe,ChristopherAberger,
and Christopher De Sa (2021). “PipeMare: Asynchronous Pipeline Par-
allel DNN Training”. In: Proceedings of Machine Learning and Systems.
Vol. 3. MLSys ’21, pp. 269–296 (cit. on p. 23).
Yang, Fan, Jinfeng Li, and James Cheng (2016). “Husky: Towards a More Ef-
ficient and Expressive Distributed Computing Framework”. In: PVLDB
9.5, pp. 420–431 (cit. on p. 60).
Yang, Zhen, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie
Tang (2020). “Understanding Negative Sampling in Graph Representation
Learning”. In: Proceedings of the 26th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining. KDD ’20. Association
for Computing Machinery, pp. 1666–1676 (cit. on p. 67).
You, Yang, Igor Gitman, and Boris Ginsburg (2017a). “Scaling SGD Batch
Size to 32K for ImageNet Training”. In: CoRR abs/1708.03888 (cit. on
p. 22).
You, Yang, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh
Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui
Hsieh (2020). “Large Batch Optimization for Deep Learning: Training
BERT in 76 minutes”. In: Proceedings of the 8th International Conference
on Learning Representations. ICLR ’20 (cit. on p. 22).
165
Bibliography
You, Yang, Zhao Zhang, Cho-Jui Hsieh, and James Demmel (2017b). “100-
epoch ImageNet Training with AlexNet in 24 Minutes”. In: CoRR
abs/1709.05011 (cit. on p. 22).
Yu, Hsiang-Fu, Cho-Jui Hsieh, Hyokun Yun, S.V.N. Vishwanathan, and In-
derjit Dhillon (2015). “A Scalable Asynchronous Distributed Algorithm
for Topic Modeling”. In: Proceedings of the 24th International Confer-
ence on World Wide Web. WWW ’15. International World Wide Web
Conferences Steering Committee, pp. 1340–1350 (cit. on pp. 27, 30).
Yun, Hyokun, Hsiang-Fu Yu, Cho-Jui Hsieh, S.V.N. Vishwanathan, and In-
derjit Dhillon (2014). “NOMAD: Non-locking, Stochastic Multi-machine
Algorithm for Asynchronous and Decentralized Matrix Completion”.
In: PVLDB 7.11, pp. 975–986 (cit. on pp. 27, 29, 30, 46).
Zaharia, Matei, Mosharaf Chowdhury, Michael Franklin, Scott Shenker,
and Ion Stoica (2010). “Spark: Cluster Computing with Working Sets”.
In: Proceedings of the 2nd Conference on Hot Topics in Cloud Computing.
HotCloud ’10. USENIX Association, pp. 10–10 (cit. on p. 25).
Zhang, Dalong, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng
Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi (2020).
“AGL: A Scalable System for Industrial-Purpose Graph Machine Learn-
ing”. In: PVLDB 13.12, pp. 3125–3137 (cit. on p. 23).
Zhang, Hantian, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang
(2017). “ZipML: Training Linear Models with End-to-End Low Precision,
and a Little Bit of Deep Learning”. In: Proceedings of the 34th International
Conference on Machine Learning. Vol. 70. ICML ’17. PMLR, pp. 4035–
4043 (cit. on p. 22).
Zhang, Xin, Jia Liu, and Zhengyuan Zhu (2018). “Taming Convergence for
Asynchronous Stochastic Gradient Descent with Unbounded Delay in
Non-Convex Learning”. In: CoRR abs/1805.09470 (cit. on p. 73).
Zhang, Zhipeng, Bin Cui, Yingxia Shao, Lele Yu, Jiawei Jiang, and Xupeng
Miao (2019). “PS2: Parameter Server on Spark”. In: Proceedings of the
2019 ACM International Conference on Management of Data. SIGMOD
’19. Association for Computing Machinery, pp. 376–388 (cit. on pp. 2,
14, 25).
Zhao, Ben, Ling Huang, Jeremy Stribling, Sean Rhea, Anthony Joseph, and
John Kubiatowicz (2004). “Tapestry: A resilient global-scale overlay for
service deployment”. In: IEEE Journal on Selected Areas in Communica-
tions 22.1, pp. 41–53 (cit. on p. 46).
Zhao, Weijie, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming
Sun, and Ping Li (2020). “Distributed Hierarchical GPU Parameter Server
166
Bibliography
for Massive Scale Deep Learning Ads Systems”. In: Proceedings of Machine
Learning and Systems. Vol. 2. MLSys ’20, pp. 412–428 (cit. on p. 23).
Zheng, Da, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song,
Quan Gan, Zheng Zhang, and George Karypis (2020a). “DistDGL: Dis-
tributed Graph Neural Network Training for Billion-Scale Graphs”. In:
CoRR abs/2010.05337 (cit. on p. 23).
Zheng, Da, Xiang Song, Chao Ma, Zeyuan Tan, Zihao Ye, Jin Dong, Hao
Xiong, Zheng Zhang, and George Karypis (2020b). “DGL-KE: Training
Knowledge Graph Embeddings at Scale”. In: CoRR abs/2004.08532 (cit.
on pp. 64, 73, 76, 77, 94).
Zheng, Qiming, Quan Chen, Kaihao Bai, Huifeng Guo, Yong Gao, Xiuqiang
He, and Minyi Guo (2021). “BiPS: Hotness-aware Bi-tier Parameter Syn-
chronization for Recommendation Models”. In: 35th IEEE International
Parallel and Distributed Processing Symposium. ISDPS ’21. IEEE Com-
puter Society, pp. 609–618 (cit. on pp. 70, 106, 114).
Zhou, Guorui, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma,
Yanghui Yan, Junqi Jin, Han Li, and Kun Gai (2018). “Deep Interest
Network for Click-Through Rate Prediction”. In: Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. KDD ’18. Association for Computing Machinery, pp. 1059–1068
(cit. on pp. 2, 9, 144).
Zhu, Rong, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai,
Yong Li, and Jingren Zhou (2019). “AliGraph: A Comprehensive Graph
Neural Network Platform”. In: PVLDB 12.12, pp. 2094–2105 (cit. on
p. 23).
Ziegler, Tobias, Carsten Binnig, and Viktor Leis (2022). “ScaleStore: A Fast
and Cost-Efficient Storage Engine Using DRAM, NVMe, and RDMA”.
In: Proceedings of the 2022 International Conference on Management of
Data. SIGMOD ’22. Association for Computing Machinery, pp. 685–699
(cit. on p. 24).
Ziegler, Tobias, Viktor Leis, and Carsten Binnig (2020). “RDMA Commun-
ciation Patterns”. In: Datenbank-Spektrum 20, pp. 199–210 (cit. on p. 24).
Zinkevich,Martin,MarkusWeimer,LihongLi,andAlexSmola(2010). “Paral-
lelized Stochastic Gradient Descent”. In: Advances in Neural Information
Processing Systems. Vol. 23. NeurIPS ’10. Curran Associates (cit. on
p. 10).
167
Bibliography
168
List of Figures
2.1 Pipeline parallel preparation and training of batches. . . . . . . 9
2.2 An example for sparse parameter access. . . . . . . . . . . . . . . 10
2.3 An example for dense parameter access. . . . . . . . . . . . . . . 11
2.4 The distributed architecture that we assume in this thesis. . . 14
2.5 An example for parameter allocation in a classic PS. . . . . . . 15
2.6
An example for parameter allocation with full static replication.
18
2.7 An example for parameter allocation in an SSP replication PS. 20
2.8
An example for parameter allocation in an ESSP replication PS.
21
3.1 The data clustering PAL technique. . . . . . . . . . . . . . . . . . 29
3.2 The parameter blocking PAL technique. . . . . . . . . . . . . . . 30
3.3 The latency hiding PAL technique. . . . . . . . . . . . . . . . . . . 32
3.4
PS architecture with server and worker threads co-located in
one process per node. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5
Parameter allocation in Lapse for three example workloads
with different PAL techniques. . . . . . . . . . . . . . . . . . . . . 37
3.6 A worker requests to localize a parameter. . . . . . . . . . . . . . 39
3.7 Routing for non-local parameter access. . . . . . . . . . . . . . . 41
3.8 Performance for matrix factorization. . . . . . . . . . . . . . . . 52
3.9 Performance for training knowledge graph embeddings. . . . 53
3.10 Performance for training word embeddings. . . . . . . . . . . . 55
3.11
Performance comparison to manual parameter management
and to Petuum for matrix factorization. . . . . . . . . . . . . . . 57
4.1 NuPS components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Number of accesses per parameter in one epoch. . . . . . . . . 66
4.3 Parameter management in NuPS. . . . . . . . . . . . . . . . . . . 72
4.4 Sampling scheme implementations in NuPS. . . . . . . . . . . . 80
4.5 End-to-end performance of different PSs on 8 nodes. . . . . . . 88
4.6 Ablation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Strong scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
169
List of Figures
4.8 Effective scalability. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.9
Performance of different sampling access management tech-
niques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.10
Impact of the management technique on epoch run time and
model quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.11
Effect of replica staleness on epoch run time and model quality.
97
5.1
Parameters held by different nodes at different times in com-
mon parameter management approaches. . . . . . . . . . . . . . 104
5.2 Example intent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 AdaPS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4
AdaPS decides automatically whether to relocate or replicate
a parameter at any time t. . . . . . . . . . . . . . . . . . . . . . . . 112
5.5 Examples for parameter management in AdaPS. . . . . . . . . 113
5.6
AdaPS relocates a parameter only when there is exactly one
node with intent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 AdaPS learns automatically when to act on an intent signal. . 117
5.8
Network communication of different approaches for placing
management responsibility. . . . . . . . . . . . . . . . . . . . . . . 120
5.9 Performance of AdaPS and existing PSs. . . . . . . . . . . . . . . 128
5.10 Strong scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.11 Network communication during one epoch. . . . . . . . . . . . 131
5.12 Performance of AdaPS and ablation variants. . . . . . . . . . . . 133
5.13
The effect of automatic action timing on epoch run time and
model quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.14
Parameter management for selected parameters in the KGE
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.15
Parameter management for selected parameters in the WE task.
138
5.16
Parameter management for selected parameters in the MF task.
139
170
List of Tables
3.1 Primitives of Lapse. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Per-key consistency guarantees of PS architectures. . . . . . . . 43
3.3 Location management strategies. . . . . . . . . . . . . . . . . . . . 45
3.4 ML tasks, models, and datasets. . . . . . . . . . . . . . . . . . . . . 49
3.5
Parameterreads, relocations, andrelocationtimes inComplEx-
Large. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1 Conformity levels of common sampling schemes. . . . . . . . . 76
4.2 ML tasks, models, and datasets. . . . . . . . . . . . . . . . . . . . . 83
4.3 Share of direct and sampling access for each ML task. . . . . . 84
4.4
Share of replicated keys, replica size, and share of accesses to
replicas for different extents of replication. . . . . . . . . . . . . 98
5.1
Approaches to distributed parameter management: adaptiv-
ity, ease of use, and efficiency for sparse workloads. . . . . . . . 105
5.2 ML tasks, models, and datasets. . . . . . . . . . . . . . . . . . . . . 125
171
List of Tables
172
List of Algorithms
2.1 Sequential mini-batch SGD. . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Asynchronous parallel mini-batch SGD. . . . . . . . . . . . . . . 12
2.3 Distributed asynchronous SGD with a classic PS. . . . . . . . . 17
2.4 Distributed asynchronous SGD with Petuum. . . . . . . . . . . . 19
3.1 Distributed asynchronous SGD with Lapse. . . . . . . . . . . . . 36
4.1 Distributed asynchronous SGD with NuPS. . . . . . . . . . . . . 71
4.2 Sampling support in distributed asynchronous SGD. . . . . . . 79
5.1 Distributed asynchronous SGD with intent signaling. . . . . . . 110
5.2 Automatic action timing in AdaPS. . . . . . . . . . . . . . . . . . . 119
173