Zipf’s law, Power-law distributions in “The Anatomy of Melancholy” – Part I

A while ago while trying to understand depression (a.k.a. why I was so fucking sad all the time), I came across a spectacular book which immediately caught my fancy. It was a 17th century book on (quite literally) The Anatomy of Melancholy written by a spectacular dude called Robert Burton. The Anatomy was first published in 1621 and was revised and republished five times by the author during his lifetime. As soon as I discovered it, I wanted to lay my hands on it and read it fully, but very soon I lost hope of that altogether.

The work itself is mind blowingly huge. To quote a reviewer in Goodreads :

And for you perverts, here is how the length of The Anatomy shakes out.

439 pages — Democritus (Burton’s persona) To The Reader and other front matter (125 pages) & First Partition.
261 pages — Second Partition
432 pages — Third Partition
Which amounts to 1132 pages. The remainder of its 1424 pages (292) consists of 6817 endnotes, (which are painlessly skippable), introductions, a glossary, and an index; unless you’ve got that ‘every damn page’ project in mind.

And mind you, the prose is difficult 17th century English. Critics have called it,”The Book to End all Books“. Burton himself writes that, “I write of melancholy by being busy to avoid melancholy.”

Though itself stating that it focusses on melancholy, the Anatomy in fact delves into much much more. “It uses melancholy as the lens through which all human emotion and thought may be scrutinized, and virtually the entire contents of a 17th-century library are marshalled into service of this goal. It is encyclopedic in its range and reference.”

Our good friends at the Project Gutenberg have made the entire text of the Anatomy available for free to the public. You can access the entire text here.

About the same time as this, I discovered Zipf’s law and it’s sister laws: Benford’s law and the Pareto distribution. (Terry tao has a nice post describing all three and how they relate to each other.)

Zipf’s law is an empirical law which says that certain data sets can be approximated by the Zipfian distribution, which is a part of a family of more general discrete power-law probability distributions. More precisely, Zipf’s law states that if X is a discrete random variable, then the n^{th} largest value of X should be approximately C n^{-\alpha} for the first few n=1,2,3 \ldots and parameters C, \alpha >0. Of course, this does not hold for any discrete random variable X. A natural question is to ask which X follows Zipf’s law. As far as I know, apart from a few general comments about X, nothing further can be said regarding this question. Terry Tao says the above laws are seen to approximate the distribution of many statistics X which

  1. take values as positive numbers
  2. range over many different orders of magnitude
  3. arise from a complicated combination of many largely independent different factors
  4. have not been artifically rounded or truncated

Tao gives examples where, if hypotheses 1 and 4 are dropped, other laws rather than ones like Zipf’s law come into play.

Zipf’s law posits an inverse relation between the ranks and frequencies, which is only common sense. So we look at a text, say the Anatomy, look at each word in it, note its frequency and then assign a rank to each word based on its frequency. So one would expect the words “and”,”the” and “if” to have a really low rank, and thus a really high frequency. One can see this clearly in the histogram below. The word “and” is ranked 1 and it has the highest frequency among all other words.

Say now, that on a whim, (to test my insane python skillz) I write a program to fetch the entire text of the Anatomy and then with the text, I make a histogram of words and their frequencies. Here’s the program. (The whole book is available in plain text by the way. So I just had to wget the whole thing available here.)

"""Let us look at the word frequency in The Anatomy of Melancholy"""

from collections import Counter
from math import log
import numpy as np
import matplotlib.pyplot as plt

file = "TAM.txt"

with open(file) as mel:
       contents = mel.read()
       words = contents.split()

"""gives a list of tuples of most common words with frequencies."""
comm = Counter(words).most_common(100)

""" Isolate the words and frequencies and also assign ranks to the words. """
labels = [i[0] for i in comm]
values = [i[1] for i in comm]
ranks = [labels.index(i)+1 for i in labels]


indexes = np.arange(len(labels))
width = 0.2

"""Histogram of word frequencies"""
plt.title("Frequency of words in 'The Anatomy of Melancholy' ")
plt.xlabel("Ranked words")
plt.ylabel("Frequency")
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels, rotation='vertical')
plt.show()

This then gives us the following graph.

2016-05-19-122614_1366x768_scrot.png

Now compare this with the picture below.

qSUgV

Doesn’t it make you want to scream, “Zipf!”?

Just to add even more weight to our hypothesis, let’s just plot the log of the frequencies against the log of the ranks of the words. If there is a power-law being followed here, we would expect the log-log graph of ranks and frequencies to be linear.

So put that into the code.

""" Log-Log graph of ranks and frequencies """

logvals=[log(x) for x in values]
logranks=[log(x) for x in ranks]

plt.title("Log-Log graph of ranks and frequencies of words in 'The Anatomy of Melancholy' ")
plt.xlabel("logranks")
plt.ylabel("logvals")
plt.scatter(logranks,logvals)
plt.show()

This now gives us the following plot.

2016-05-19-124124_1366x768_scrot

Hmm. Seems like there is a quantity x_{\text{min}} after which the plot is almost linear. That is, it looks like the tail follows a power-law. Naturally, at this stage, we would want to do a least squares linear regression and fit a line to this plot, and if it’s a good fit, use that to conclude that we have a power-law!

Unfortunately, it’s not that simple. A lot of distributions give a straight-ish line in a log-log plot. So that is just not enough.

Also, like how Cosma Rohilla Shalizi states in his blog, least squares linear regression on a log-log plot is a bad idea, because even if your data follows a power-law it gives a bad estimate on the parameters x{{\text{min}} and \alpha. Cosma even adds that even though this was what Pareto essentially did in 1890, “there is a time and place to be old school, and this is not it”.

We talked about x_{\text{min}} above. How do we get an estimate of that? How do we know where the power-law starts?

All fantastic questions! What about answers?

Well, up till 2009, no one knew how to answer them. Then there was a paper by Clauset,Shalizi and Newman which set all these matters to rest.

Indeed, to state whether a given data set is approximated satisfactorily by a power-law such as Zipf’s law is quite a tricky business, and it is this question which we shall be tackling later on. I hope to write another blog post after I’ve read through the paper and coded their recipe.

Till then, cheers!