Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, pages 146–150
Florence, Italy, August 2, 2019. c
2019 Association for Computational Linguistics
146
Times Are Changing: Investigating the Pace of Language Change
in Diachronic Word Embeddings
Stephanie Brandl, David Lassner
Machine Learning Group, TU Berlin, Berlin, Germany
{stephanie.brandl, lassner}@tu-berlin.de
Abstract
We propose Word Embedding Networks
(WEN), a novel method that is able to learn
word embeddings of individual data slices
while simultaneously aligning and ordering
them without feeding temporal information a
priori to the model. This gives us the opportu-
nity to analyse the dynamics in word embed-
dings on a large scale in a purely data-driven
manner. In experiments on two different news-
paper corpora, the New York Times (English)
and Die Zeit (German), we were able to show
that time actually determines the dynamics of
semantic change. However, we find that the
evolution does not happen uniformly, but in-
stead we discover times of faster and times of
slower change.
1 Introduction
Vectorial representation of natural language,
known as word embeddings, have been widely
used in e.g. text classification (Joulin et al.,2016)
and machine translation (Mikolov et al.,2013).
As in Kim et al. (2014); Kulkarni et al. (2015);
Hamilton et al. (2016) and Szymanski (2017)
aligned sets of embeddings have also been used
to detect changes in vectorial representations of
words over time. In the past, those changes have
mostly been studied at the word-level.
We propose a novel method to investigate the pace
of language change based on the entire embedding
matrix. Previous approaches have not been able to
carry out this type of analysis, as they have taken
the continuous change of language for granted
and investigated those dynamics in a supervised
manner.
Therefore, we present Word Embedding Net-
works (WEN), a method that has no knowledge
about the chronological order of the slices, so we
can investigate semantic changes on the whole
vocabulary purely data-driven and unsupervised.
Pairwise relations between embeddings and the
embeddings themselves are learned simultane-
ously without feeding the temporal information a
priori into the algorithm. In that, it is substantially
more flexible than those methods mentioned
above. This means that dynamics between any
slicing of a text corpus can be learned (especially
those where there is no order known) and the
result not only contains embeddings for each
slice, but an order of slices that corresponds to the
dynamics of word meanings.
This method also overcomes the need of a two-
step solution for aligned temporal embeddings,
as has also been done by Yao et al. (2018) - the
two-step solution has, according to Yao et al.
(2018), its weaknesses especially in the case of
non-uniformly distributed amounts of data across
the slices. Closer proximities between embed-
dings denote time intervals of slower semantic
changes, embeddings are farther apart when times
are changing faster.
2 Related Work
Rudolph and Blei (2018) analyse dynamical
changes in word embeddings based on exponen-
tial family embeddings, a probabilistic framework
that generalizes the concept of word embeddings
to other types data (Rudolph et al.,2016). They
focus on word-level changes within and between
text corpora spanning from the 19th century until
today.
The authors of Yao et al. (2018) proposed a new
method to learn individual word embeddings for
each year of the New York Times data set (1990-
2016) while simultaneously aligning the embed-
dings to the same vector space. Their neighbor-
hood constraint
τ
2kUt−1−Utk2
F+kUt−Ut+1k2
F
147
U1 U1U2 U2
U3
U3
U4 U4
U5 U5
w
w
w
1,2
1,4
1,5
(a) Dynamic Word Embeddings from (Yao et al.,2018)
has a predefined ordering of embeddings U·.
U1 U1U2 U2
U3
U3
U4 U4
U5 U5
w
w
w
1,2
1,4
1,5
(b) WEN learns embeddings U·and ω·,·in turn. Thicker
edges denote stronger relation between embeddings.
Figure 1: Comparison of Dynamic Word Embeddings (Yao et al.,2018) and WEN which can be seen as a gener-
alization of the former.
encourages alignment of the word embeddings.
The parameter τcontrols the dynamic, thus how
much neighboring word embeddings are allowed
to differ (τ= 0: no alignment and τ→ ∞: static
embeddings).
3 Method
To identify the pace of change, we introduce a
new method named Word Embedding Networks
(WEN). WEN learns embeddings for e.g. different
time slices while simultaneously aligning and or-
dering them. WEN starts with assuming an equal
distance between all embeddings and then, over
time, shapes the relations by moving certain em-
beddings closer and others farther apart. In Fig. 1
we illustrate an exemplary trained word embed-
ding network and compare it to Dynamic Word
Embeddings from Yao et al. (2018).
In order to train the weights of the graph, we in-
clude an additional weighting term ωt,t0into the
model and optimize over
min
Ut
Ft= min
Ut
1
2
Yt−UtU>
t
2
F(1)
+λ
2kUtk2
F(2)
+τ
2
N
X
t06=t
ωt,t0kUt−Ut0k2
F.(3)
Here, Ut∈RV×Dcontains the D-dimensional
word embeddings in a vocabulary of size V at time
point t and Yt∈RV×Vrepresents the PPMI ma-
trix (Yao et al.,2018).
While Term 1is responsible for training the word
embeddings with respect to Y, Term 2enforces
sparse vectorial representations.
By updating ωt,t0with respect to the distances be-
tween word embeddings of different slices it is
meant to strengthen connections of word embed-
dings that lie closer together in the corresponding
vector space.
To update ωt,t0we first introduce a symmetric nor-
malization function
norm sym(xij) = xij
Pkxik +Pjxkj
where xij ∈R.
The weighting term ωt,t0is then updated accord-
ingly:
dt,t0=norm sym 1
kUt−Ut0k2
F!
ωnew
t,t0=norm sym (ωt,t0+dt,t0).(4)
We optimize Uwith gradient descent. We there-
fore use Adam (Kingma and Ba,2014) with de-
fault values for βand a customized learning rate
(see Sec. 4.2).
We did not tune for efficiency and stopped the op-
timization after 1500 rounds where one round is
finished when embeddings of all time slices have
been updated once in a random order.
We initialize ∀t, t0ωt,t0= 1 and update every 100
rounds according to Eq. 4.
We have implemented this in PyTorch.
4 Training
4.1 Data Sets
New York Times 1990-2016:
The New York Times data set1(NYT) contains
1https://sites.google.com/site/
zijunyaorutgers/
148
(a) 2-dimensional embedding of the New York Times wmatrix.
Slices are sorted nicely in a circular structure with only few
excepions.
(b) 2-dimensional embedding of the Die Zeit wmatrix. The
embeddings-embedding still resembles the chronology but there
is also a secondary structure of three clusters.
Figure 2: Laplacian eigenmaps of the wmatrix to visualize the relationship between embeddings (pace of change).
Points are colored with ground truth information.
headlines and lead texts of news articles published
online and offline between 1990 and 2016 with a
total of 99.872 documents.
Die Zeit 1947-2017:
Die Zeit is a German national weekly newspaper
that started publishing in 1946. We obtained titles,
teaser titles and teaser texts of 508.698 news ar-
ticles from the Die Zeit developer API2that have
been published online and offline between 1947
and 2017.
4.2 Parameters
We perform a grid search to select optimal param-
eters on the first half (1990-2002) of NYT which
results in the same parameter combination as re-
ported by (Yao et al. (2018), λ= 10,τ= 50, em-
bedding dimension= 32). We start with a learning
rate of η= 10−3, reducing it after 500 rounds
to η500 = 5 ·10−4and after 1000 rounds to
η1000 = 10−4.
4.3 Preprocessing
We lemmatize the data with spacy 3.
We only consider the 20.000 most frequent (lem-
matized) words of the entire data set that are also
under the 20.000 most frequent words in at least
2http://developer.zeit.de
3https://spacy.io
3 yearly slices. This way, we filter out “trend”
words that only are of significance within a very
short time period. The 100 most frequent words
are filtered out as stop words.
4.4 Experiments
New York Times 2003-2016:
We apply WEN with the parameters from Section
4.2 to the second part of NYT (2003-2016) and
train embeddings for yearly slices (14 in total). In
most of the cases, namely 85%, WEN aligns em-
beddings closest to each other that in fact also cor-
respond to their chronological neighbor. We de-
fine this as the (neighborhood) accuracy of 85%.
Die Zeit 1947-2017:
We further apply WEN on the Die Zeit data set
(1947-2017) to evaluate the model on a non-
English text corpora which has not been involved
in parameter search. We train and sort the word
embeddings on the entire data set (71 yearly
slices). After 1500 rounds, we achieve an accu-
racy of 67%.
5 Results
The high neighborhood accuracies, indeed indi-
cate a temporal dynamic in both data sets.
To visualize the network weights as a map, w
was used as an affinity matrix to generate Lapla-
149
cian eigenmaps (Belkin and Niyogi,2002) of the
neighborhoods between yearly slices. In Fig. 2
maps of NYT (second half) and Die Zeit are
shown side-by-side.
NYT shows a very nice circular structure of neigh-
borhoods but also how some years lie closer to-
gether than others. For instance the years 2008-
2010 can be found very close together whereas
2011 having larger gaps to its two closest neigh-
bors 2010 and 2012. By identifying words of
largest change within 2011, we found mostly per-
sonal names and companies in the high ranks
whose media coverage changed during that time.
Also, we detected larger gaps in actual neighbor-
ing years when there are shifts in how sections
are distributed in the data set. As Sports remains
the section with most articles throughout the entire
data set, there are more documents in Opinion af-
ter 2003 and only half the documents in New York
and Region after 2005.
Regarding the Die Zeit map, we observe three dis-
tinct clusters, one before 1995, one after 2008.
The gaps clearly correspond to changes in either
publication strategy (starting 2009 with emphasiz-
ing the online publication) and archival data stor-
age (there are close to no teaser texts available for
1995). Both events led to a sudden change of the
amounts of data available.
6 Conclusion
Word Embedding Networks (WEN) learns word
embeddings of individual data slices while simul-
taneously aligning and ordering them in an unsu-
pervised manner.
After being trained on news articles from the New
York Times (1990-2002), the model could suc-
cessfully be applied to news articles from the
same corpus (2003-2016) and to data containing
German newspaper articles from 1947-2017 (die
Zeit). Results on both data sets show a clear tem-
poral dynamic as 85% and 67% respectively of the
closest time slices correspond to the neighboring
years. Time can thus be identified as the domi-
nant component that is governing change in word
meaning in both data sets.
However, it could be shown for both data sets
that change is not introduced at a constant pace
hence there are times of slower and times of faster
change. We found that distributional changes
within the data set can have a huge influence on
the perceived pace of semantic change.
Therefore we argue for caution when applying
models that assume continuous change, especially
concerning the NYT data set, with its widespread
use.
For further research, we would like to expand
the experiments to corpora where the underlying
slices are not ordered. For example given a corpus
of works grouped by authors, we could train the
model to find proximities between authors based
on the similarity of meaning of the words they use.
Acknowledgments
This work was supported by the Federal Ministry
of Education and Research (BMBF) for the MALT
III project (01IS17058). We thank M. Alber for
valuable comments on the manuscript.
References
Mikhail Belkin and Partha Niyogi. 2002. Laplacian
eigenmaps and spectral techniques for embedding
and clustering. In Advances in neural information
processing systems, pages 585–591.
William L Hamilton, Jure Leskovec, and Dan Juraf-
sky. 2016. Diachronic word embeddings reveal sta-
tistical laws of semantic change. arXiv preprint
arXiv:1605.09096.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2016. Bag of tricks for efficient text
classification. arXiv preprint arXiv:1607.01759.
Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde,
and Slav Petrov. 2014. Temporal analysis of lan-
guage through neural language models. arXiv
preprint arXiv:1405.3515.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and
Steven Skiena. 2015. Statistically significant de-
tection of linguistic change. In Proceedings of the
24th International Conference on World Wide Web,
pages 625–635. International World Wide Web Con-
ferences Steering Committee.
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013.
Exploiting Similarities among Languages for Ma-
chine Translation. CoRR, abs/1309.4168.
Maja Rudolph and David Blei. 2018. Dynamic embed-
dings for language evolution. In Proceedings of the
2018 World Wide Web Conference on World Wide
Web, pages 1003–1011. International World Wide
Web Conferences Steering Committee.
150
Maja Rudolph, Francisco Ruiz, Stephan Mandt, and
David Blei. 2016. Exponential family embeddings.
In Advances in Neural Information Processing Sys-
tems, pages 478–486.
Terrence Szymanski. 2017. Temporal Word Analo-
gies: Identifying Lexical Replacement with Di-
achronic Word Embeddings. In Proceedings of the
55th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), vol-
ume 2, pages 448–453.
Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and
Hui Xiong. 2018. Dynamic Word Embeddings for
Evolving Semantic Discovery. In Proceedings of
the Eleventh ACM International Conference on Web
Search and Data Mining, pages 673–681. ACM.