Zipf’s law, Power-law distributions in “The Anatomy of Melancholy” – Part I

A while ago while trying to understand depression (a.k.a. why I was so fucking sad all the time), I came across a spectacular book which immediately caught my fancy. It was a 17th century book on (quite literally) The Anatomy of Melancholy written by a spectacular dude called Robert Burton. The Anatomy was first published in 1621 and was revised and republished five times by the author during his lifetime. As soon as I discovered it, I wanted to lay my hands on it and read it fully, but very soon I lost hope of that altogether.

The work itself is mind blowingly huge. To quote a reviewer in Goodreads :

And for you perverts, here is how the length of The Anatomy shakes out.

439 pages — Democritus (Burton’s persona) To The Reader and other front matter (125 pages) & First Partition.
261 pages — Second Partition
432 pages — Third Partition
Which amounts to 1132 pages. The remainder of its 1424 pages (292) consists of 6817 endnotes, (which are painlessly skippable), introductions, a glossary, and an index; unless you’ve got that ‘every damn page’ project in mind.

And mind you, the prose is difficult 17th century English. Critics have called it,”The Book to End all Books“. Burton himself writes that, “I write of melancholy by being busy to avoid melancholy.”

Though itself stating that it focusses on melancholy, the Anatomy in fact delves into much much more. “It uses melancholy as the lens through which all human emotion and thought may be scrutinized, and virtually the entire contents of a 17th-century library are marshalled into service of this goal. It is encyclopedic in its range and reference.”

Our good friends at the Project Gutenberg have made the entire text of the Anatomy available for free to the public. You can access the entire text here.

About the same time as this, I discovered Zipf’s law and it’s sister laws: Benford’s law and the Pareto distribution. (Terry tao has a nice post describing all three and how they relate to each other.)

Zipf’s law is an empirical law which says that certain data sets can be approximated by the Zipfian distribution, which is a part of a family of more general discrete power-law probability distributions. More precisely, Zipf’s law states that if X is a discrete random variable, then the n^{th} largest value of X should be approximately C n^{-\alpha} for the first few n=1,2,3 \ldots and parameters C, \alpha >0. Of course, this does not hold for any discrete random variable X. A natural question is to ask which X follows Zipf’s law. As far as I know, apart from a few general comments about X, nothing further can be said regarding this question. Terry Tao says the above laws are seen to approximate the distribution of many statistics X which

  1. take values as positive numbers
  2. range over many different orders of magnitude
  3. arise from a complicated combination of many largely independent different factors
  4. have not been artifically rounded or truncated

Tao gives examples where, if hypotheses 1 and 4 are dropped, other laws rather than ones like Zipf’s law come into play.

Zipf’s law posits an inverse relation between the ranks and frequencies, which is only common sense. So we look at a text, say the Anatomy, look at each word in it, note its frequency and then assign a rank to each word based on its frequency. So one would expect the words “and”,”the” and “if” to have a really low rank, and thus a really high frequency. One can see this clearly in the histogram below. The word “and” is ranked 1 and it has the highest frequency among all other words.

Say now, that on a whim, (to test my insane python skillz) I write a program to fetch the entire text of the Anatomy and then with the text, I make a histogram of words and their frequencies. Here’s the program. (The whole book is available in plain text by the way. So I just had to wget the whole thing available here.)

"""Let us look at the word frequency in The Anatomy of Melancholy"""

from collections import Counter
from math import log
import numpy as np
import matplotlib.pyplot as plt

file = "TAM.txt"

with open(file) as mel:
       contents = mel.read()
       words = contents.split()

"""gives a list of tuples of most common words with frequencies."""
comm = Counter(words).most_common(100)

""" Isolate the words and frequencies and also assign ranks to the words. """
labels = [i[0] for i in comm]
values = [i[1] for i in comm]
ranks = [labels.index(i)+1 for i in labels]


indexes = np.arange(len(labels))
width = 0.2

"""Histogram of word frequencies"""
plt.title("Frequency of words in 'The Anatomy of Melancholy' ")
plt.xlabel("Ranked words")
plt.ylabel("Frequency")
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels, rotation='vertical')
plt.show()

This then gives us the following graph.

2016-05-19-122614_1366x768_scrot.png

Now compare this with the picture below.

qSUgV

Doesn’t it make you want to scream, “Zipf!”?

Just to add even more weight to our hypothesis, let’s just plot the log of the frequencies against the log of the ranks of the words. If there is a power-law being followed here, we would expect the log-log graph of ranks and frequencies to be linear.

So put that into the code.

""" Log-Log graph of ranks and frequencies """

logvals=[log(x) for x in values]
logranks=[log(x) for x in ranks]

plt.title("Log-Log graph of ranks and frequencies of words in 'The Anatomy of Melancholy' ")
plt.xlabel("logranks")
plt.ylabel("logvals")
plt.scatter(logranks,logvals)
plt.show()

This now gives us the following plot.

2016-05-19-124124_1366x768_scrot

Hmm. Seems like there is a quantity x_{\text{min}} after which the plot is almost linear. That is, it looks like the tail follows a power-law. Naturally, at this stage, we would want to do a least squares linear regression and fit a line to this plot, and if it’s a good fit, use that to conclude that we have a power-law!

Unfortunately, it’s not that simple. A lot of distributions give a straight-ish line in a log-log plot. So that is just not enough.

Also, like how Cosma Rohilla Shalizi states in his blog, least squares linear regression on a log-log plot is a bad idea, because even if your data follows a power-law it gives a bad estimate on the parameters x{{\text{min}} and \alpha. Cosma even adds that even though this was what Pareto essentially did in 1890, “there is a time and place to be old school, and this is not it”.

We talked about x_{\text{min}} above. How do we get an estimate of that? How do we know where the power-law starts?

All fantastic questions! What about answers?

Well, up till 2009, no one knew how to answer them. Then there was a paper by Clauset,Shalizi and Newman which set all these matters to rest.

Indeed, to state whether a given data set is approximated satisfactorily by a power-law such as Zipf’s law is quite a tricky business, and it is this question which we shall be tackling later on. I hope to write another blog post after I’ve read through the paper and coded their recipe.

Till then, cheers!

Advertisements

The capacity to be alone – An obituary.

There is not a single day which passes in which I don’t see your name, or your influence and breathtaking power seeping through in the structures you created and called your own. Years ago I promised myself that I shall become like you, someone exactly like you…with superhuman prowess and might. Today with extreme sadness I realize that those dreams of mine were laughably childish. I shall never become half, nay, even a quarter of what you were.

You left the earth a year ago, leaving me irrevocably sad that I had not known you or spoken to you whilst you were alive. I had dreamt and prayed that you make an appearance in the future somehow alongside me and that I could just see you and maybe exchange a few plesantries, as I wouldn’t have been capable of expressing in words my admiration for you.

I am not that kid I once was. Life has been cruel to me because it has all but robbed me of my chance of following your footsteps. But then again, I don’t know if all this was meant to be or if it was just me not working as hard as I should have.

I see the others every day. They surround me and talk around me and I am forced to listen. I am, as you were too, surprised by them sometimes, surprised by the facility with which they pick up, as if at play, new ideas, juggling them as if familiar with them from the cradle. But look where they are now, and look where you are. They pale in comparison. I ask myself if it will be the same for me? Of course it won’t.

During your later years you became dissatisfied with the system. You said you have retreated more and more from the scientific “milieu”. You said you noticed the outright theft perpetrated by colleagues of yours and that was why you declined the recognition being bestowed on you.  This dissatisfaction which you had then has now made it’s way inside me.  I am dissatisfied as well, but it is more of me being bitter because I have been rejected. How else should one respond to someone letting you know that you aren’t good enough?

What makes my heart ache is that I shall never again discover that beauty for myself. That single moment of clarity which reveals the structure behind mathematics in that synchronous harmony which is it’s own. You have experienced what I am talking about. I have too, but not nearly enough.

Now that it won’t be possible for me to experience it ever, what then, should my raison d’être be?

What I am most scared about, is that it is now that that bond between you and me shall begin to falter and eventually fade. You shall become just another famous name I know and there will be nothing in common between us.

The others have intimidated me all through my life. And whenever they have, your words have been the most powerful consolation I could have ever asked for. What then, will be my consolation when the bond between us breaks?

I miss you, Shurik. I miss you like a pupil misses his master. I miss you despite the fact that I have never seen you or heard your voice. I miss the joy you used to give me when I discovered I shared the same passion you had. I miss the fact that I won’t be able to call myself a mathematician anymore-in the fullest sense of the word: someone who “does” math, like someone “makes” love.

I have long contemplated learning French for the sole purpose of reading Récoltes et Semailles. I think now that I won’t. Reading it will be too painful for me and I have just about had enough disappointment to last me a life time.

Wherever you are, Alexander Grothendieck, rest in peace and know that you are missed.

Lain weather widget : “Service not available at the moment”.

On my Arch system I use copycat-killer’s awesome themes. One of my favorites is the “multicolor” theme because it gives me a ton of info and it looks snappy and nice overall. One thing I was most happy about was the weather widget that came along with the themes. Anyway, here’s a pic of it in action during happier times.

2016-03-29-003232_1366x768_scrot.png

But since the past few days, the weather indicator on top was annoyingly showing “N/A” and the weather widget was giving me a “Service not available at the moment.” message.

3mpryRs

I googled around a bit and found out that the lain weather widget uses the Yahoo Weather API and that the devs at Yahoo decided that after 15th March, their weather API would give out data only to requests which were upgraded to Oauth 1, whatever the hell that is.

One of the answers here, however, suggested to simply replace “http://weather.yahooapis.com/” with “http://xml.weather.yahoo.com/” and I decided to do just that in the lain weather widget.

The init.lua file for the lain widgets which were part of the “yawn” library was at

/home/kody/.config/awesome/lain/widgets/yawn/init.lua

In that, I replaced the line

local api_url = 'http://weather.yahooapis.com/forecastrss'

to

local api_url = 'http://xml.weather.yahoo.com/forecastrss'

And then restarted awesome with Modkey+Ctrl+r, which resulted in my pretty looking working weather panel again. 🙂

UPDATE 12/04/16 : The weather widget stopped working again. Yahoo used to be cool! 😦

UPDATE 24/04/16 : After tinkering again, I realized that my current config files were not up to date with Copycat-killer’s github ones. He’s not using Yahoo Weather API anymore, but is using the OpenWeatherMap API. I used the new weather.lua widget of his and things are up and running once more.

 

How do I into GNU Privacy Guard?

I’ve asked myself the above question a ton of times and this guide here is the product of my struggles.

First of all, install the GPG package. If you are on Arch, you can use

# pacman -S gnupg

Also install rng-tools. This gives a random number generator which helps to add entropy quicker when generating your keys. (I have read that this results in insecure keys, but I think it should be pretty okay.)

# pacman -S rng-tools

Start the random number generation dump.

# rngd -r /dev/urandom

Then generate your keys.

gpg --gen-key

The process should be relatively painless. The defaults for everything should be good enough. Enter your name, comment and email-ID, secure your private key with a very strong passphrase. After enough entropy is created, you should be done. Now

 gpg --list-keys

Should show you your freshly generated key! Now your private key should be kept secret at all costs. Lock it away as if your life depended on it. If someone gets access to it, you’re screwed.

The next thing you need to do is to share your public key with people you want to communicate securely with. Most people just put their public keys on keyservers like the MIT keyserver and people who want a particular person’s public key can get their public key from the keyserver directly.

In order to put up your key on the MIT keyserver, create an ASCII-armored version of your public key :

gpg --export -a "KEYNAME" > public.key

Now copy and paste the contents of ‘public.key’ in the text box on the MIT keyserver page, it’ll take care of the rest.

Okay good. You have now created a public-private key pair and you have set up your public key for display. If you now find someone who you want to send an encrypted message to, first get ahold of their public key. Copy and paste their ASCII public key block into a file (I’m going to call it FRIENDS_PUB_KEY) and put it on your system. Next, you need to IMPORT that key of theirs.

gpg --import FRIENDS_PUB_KEY

The first thing you need to do after creating (or importing other public keys ) is to check your key’s fingerprint.

 gpg --edit-key FRIENDS_PUB_KEY

Now enter the following.

 gpg> fpr

This should output the key’s fingerprint.

This next step is very important. Once you have your friend’s key’s fingerprint, you need to verify that you and him have the SAME fingerprint (either over the phone, or snail mail or pigeon post). This ensures that the key has not been tampered with and that you will really be sending your encrypted messages to your friend and not to a man-in-the-middle. If your fingerprint and your friend’s fingerprint don’t match, then that means that someone has tampered with your friend’s public key and is probably waiting for you to send all of your messages to him instead of your friend.

Now if you want to encrypt a text file, a picture etc. this is how you do it.

gpg -e -u "YOUR_KEY_NAME" -r "RECEIPIENT_KEY_NAME" somefile 

This will create a file called ‘somefile.gpg’. You can now send this file over to your friend and be confident that he and only he will be able to decrypt it’s contents (unless ofcourse you’re in deep shit).

In order to decrypt a file you have got from a friend, here is what you do.

gpg -d somefile.gpg

GPG will automatically search for the relevant secret key to do the decryption with, if it finds the key, your file will be sucessfully decrypted. If the secret key doesn’t exist, it’ll complain saying that it can’t find the secret key.

Geeky shit with SAGE, notify-send and XKCD!

I came across this xkcd comic a while back and have practicing that every time I was bored and had nothing to do. (Pro tip : This kills a helluva lot of time while travelling!)

Now this semester I am taking a course on Algebraic Number Theory and some of the take home assignments required me to compute the class number of certain number fields. Instead of doing this the hard way, I decided to cheat and use SAGE,which is a computer algebra system for mathematicians to do fancy stuff. I haven’t explored it fully yet and I intend to do so during the summer. It’s really powerful because it integrates existing math software into it instead of “reinventing the wheel”.

One day when I was bored I decided to write a simple bash script to factor the time and display the result as a notification. The idea of the code was simple : get the time as a four digit number HHMM, pass that number to sage, ask it to factorize, store everything as a variable and push the factorization as a notification to the home screen.

After a bit of searching, I found out about notify-send, which is a cool little package which helps in displaying notifications.

Putting all these together, I came up with the following script.

#! /bin/sh
#
# factor.sh
# Copyright (C) 2016 kody <kody@kodyarch>
#
# This short script computes the prime factorization of the time considered
# as a four digit number. For example, it looks at 20:04 as 2004 and computes
# it's prime factorization.
#

TIME=$(date "+%H%M")
echo 'factor('$TIME')' | sage > ~/test1

FACTOR1=$(awk 'FNR==6 {print}' ~/test1 | cut -d ' ' -f2-)
FACTOR2=$(echo "$FACTOR1")

notify-send -t 10000 "The current time is $TIME." "And it's factorization is $FACTOR2."

rm ~/test1

Here is a picture of it in action!

screenFetch-2016-03-01_12-57-14

To send desktop notification from a background script, say via cron, running as root (replace X_user with the user running X):

 
# sudo -u X_user DISPLAY=:0 notify-send 'Hello world!' 'This is an example notification.'

 

O ye with silken hair.

A little something I just thought of writing on the spur while gazing hard at the hair of the girl sitting in front of me in a really really boring class.

O ye with silken hair,
Looking at you I despair.
For your locks, soft and meek
Seem like a mollified fractal, so to speak.

Just as in Mandelbrot’s set,
Your hair has a main bulb, where the bun has met
At a point, so near, so far and so light
Inaccessible but within sight.

From here, O maiden fair,
Emerge taut strands of hair
Like geodesics from infinity to and fro,
How perfect they go!

Your tresses, maiden fair, I recall
Seem like a tangent vector field
on the wedge sum of two spheres, big and small.

Brouwer was surely high,
when he proved the following lie:
“One cannot comb a hairy ball!”

O maiden fair, show Brouwer he is wrong,
His “proof” has stood for long,
Far too long!

As I write these lines, my conscience does prick,
People might whack me with a stick:
Brouwer’s theorem holds for a ball,
Not for the wedge sum of two spheres, big and small!

Alas, maiden fair, I was wrong.
But I’m not sad, for I have this song.
And now, I thank you, maiden fair,
For letting me write about your hair.

Fastest (and upto date) pacman mirrors.

We’ve all faced this problem sometime or the other : Installing/upgrading packages takes forever because the pacman mirrors are slow.
Thankfully, as always, the Arch wiki and pacman have us covered. Pacman itself comes with a bash script which is at

/usr/bin/rankmirrors

. This script ranks mirrors according to their connection and opening speeds.
First make a backup of your existing mirrorlist.

# cp /etc/pacman.d/mirrorlist /etc/pacman.d/mirrorlist.backup

Edit the mirrorlist and uncomment all mirrors for testing.

# sed -i 's/^# Server/Server/' /etc/pacman.d/mirrorlist.backup

Finally rank the mirrors. The ‘6’ below outputs the 6 fastest mirrors.

# rankmirrors -n 6 /etc/pacman.d/mirrorlist.backup > /etc/pacman.d/mirrorlist

Note that it is good practice now to run the following.

# pacman -Syyu

This forces a redownload of the package lists and upgrades the packages. Passing two
–refresh or -y flags will force a refresh of all package lists even if they appear to be up to date.