# Script to backup home directory.

My laptop is in it’s final stages. It might die any moment. And since I’ve spent almost three years moulding my Arch install on it to my liking, I spent many sleepless nights pondering about what I’d do if it crashed and never woke up.

Earlier I wrote a blog post on how to transfer an existing Arch install onto a new laptop. In that blog post, I noted that in essence, it is the home directory and the list of packages of an Arch install which uniquely determines it and so if I had a copy of these, in theory, I would be able to transfer my install onto any new laptop. The steps to do this were outlined in that post.

So the question remained on automating this process. To acheive that, I wrote a simple bash script.

#!/bin/bash

echo "Backing up to HDD..."
# Use rsync to backup to external HDD.
sudo rsync --info=progress2 -aAXn --delete --exclude={/home/*/.thumbnails/*,/home/*/.cache/mozilla/*,/home/*/.local/share/Trash/*} /home/kody/ /run/media/kody/TOSHIBA\ EXT/ARCH-BACKUP-2017/
echo "Back up done."
# Make a list of packages and store them in files on the HDD.
echo "Making a list of AUR and Pacman Packages and storing it on the HDD..."
pacman -Qqe | grep -vx "$(pacman -Qqm)" > /run/media/kody/TOSHIBA\ EXT/ARCH-BACKUP-2017/Packages_$(date '+%Y-%m-%d')

pacman -Qqm > /run/media/kody/TOSHIBA\ EXT/ARCH-BACKUP-2017/Packages_AUR_$(date '+%Y-%m-%d') echo "Done."  Adding the ‘n’ option in rsync does a dry run first so you can see if things are okay before going in for the kill. # Tlön, Uqbar, Orbis Tertius,Ultrafinitism and Depression. Throughout my life, or more precisely, ever since I’ve attained wisdom, there have been individuals (“Mutants”) who have, through their sheer intellect and brilliance, have managed to influence, impress upon and shape my thoughts . This familiar story of a single individual seeking and marking down a list of men who have provided him inspiration is not new. For example, Alexander Grothendieck (one of the “mutants” in my list), listed his own set of mutants in his Notes pour la Clef des Songes. His list contains eighteen names. My list contains a modest five. These “Mutants” are human beings who are ahead of their time, precursors of a coming “New Age”. They are distinguished by internal freedom, insight into the nature of humanity and by the depth of Platonic genius inherent in their work. Inspection of their lives reveals periods in which each was tortured by their own mind, as if the weight of their genius was unbearable to them. All of these men (possibly with the exception of Da Vinci), at some point in their lives, were struck by melancholy, and learning how they dealt with their melancholia is greatly enlightening. I myself have the tendency to slip into depression often; and during one of these manic depressive episodes, I happened to recall a line of Borges’, where he talked about writing a poem as “working [his sadness] out of his system and making something out of his experience”. So, I decided that I should do something similar myself: work the sadness out of my system and squeeze out something positive from it. It was then that I played around with an idea in my head, just for fun and to see where it would take me. The main idea is derived from Jorge Luis Borges’ short story “Tlön, Uqbar and Orbis Tertius“, wherein Borges describes a universe which has completely adopted Berkelyean Idealism without a God. In a nutshell this means that while Berkeley has posited that only minds and mental constructs exist and thus the world exists because it is the mental construct of a God, Borges describes a Berkeleyan universe without a God, so all that exists is only that which people imagine in that particular instant, and the world is a series of such instants.Borges then describes various features of this curious universe, including its grammar, literature and so on.I liked to imagine this universe as an infinitely dark room with people having attached flashlights to their heads. If a person’s flashlight falls on something, it would mean that he is imagining that thing. Structures which are imagined by the people in the room are only as vivid as the amount of light falling on them, and are capable of being extinguished completely if there is no light around them. This is precisely what Borges describes at the very end of the text. I was interested in the imaginary mathematics of such a universe. At first, I thought it would it be sensible to say that the imaginary mathematics of such a universe would be equivalent to the Ultrafinitism of our own mathematics. A few words about Ultrafinitism first. Ultrafinitism is a branch of Constructivism as a Philosophy of Mathematics. Constructivists believe that it is necessary to “find” or “construct” a mathematical object in order to prove it’s existence. So for example, $\pi$ exists because we have $\frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} -\frac{1}{7} ...$ On the other hand, something shown to exist by proof of contradiction is something which the constructivists don’t allow to be labelled “existing”. Because you haven’t explicitly constructed it. Ultrafinitists take it a step further. Ultrafinitists deny the existence of the set of naturals $\mathbb{N}$, because it can never be completed. Here is an example of a conversation (taken from Harvey M. Friedman “Philosophical Problems in Logic”) with a well known Ultrafinitist, Alexander Esenin-Volpin, who sketched a program to prove the consistency of Zermelo-Frankael Set Theory with the Axiom of Choice in Ultrafinite Mathematics. I have seen some ultrafinitists go so far as to challenge the existence of 2100 as a natural number, in the sense of there being a series of “points” of that length. There is the obvious “draw the line” objection, asking where in 21, 22, 23, … , 2100 do we stop having “Platonistic reality”? Here this … is totally innocent, in that it can be easily be replaced by 100 items (names) separated by commas. I raised just this objection with the (extreme) ultrafinitist Esenin-Volpin during a lecture of his. He asked me to be more specific. I then proceeded to start with 21 and asked him whether this is “real” or something to that effect. He virtually immediately said yes. Then I asked about 22, and he again said yes, but with a perceptible delay. Then 23, and yes, but with more delay. This continued for a couple of more times, till it was obvious how he was handling this objection. Sure, he was prepared to always answer yes, but he was going to take 2100 times as long to answer yes to 2100 then he would to answering 21. There is no way that I could get very far with this. On the other hand, Intuitionistic Logic and Primitive Recursive Arithmetic are agreed to be foundations for Constructivism and Finitism respectively. The appropriate foundations for Ultrafinite mathematics is still an open question. Now coming to mathematics in Tlön, Borges actually writes a couple of lines about how the mathematics and geometry of Tlön is. He write that in Tlön, the “very act of counting, changes the number being counted”. My initial hunch about the equivalence of Tlön arithmetic and Ultrafinite arithmetic was based on the fact that our minds can only “picture” small numbers. We surely cannot picture 2^100 trees without being unsure of the number being pictured. To make this intuition rigorous, I can think of the following steps. 1. Pin down axioms for mathematics in Tlön: One can start this by looking at analogues of Peano Arithmetic in Tlön. 2. Pin down axioms for Ultrafinite mathematics: This may be a problem because in the preliminary reading that I have done, I have learnt that there are no formal foundations for Ultrafinitism and this is one of the main problems of this field. If not, maybe one can start with axioms for Constructivism (Intuitionistic Logic) or Finitism (Primitive Recursive Arithmetic) 3. Show that they are equivalent: Show that the axioms imply each other. A cool way of doing this would be through Coq, the proof assistant. # Twitter bots and WhatsApp bots. I made an automated Twitter bot which tweets out funny sentences from a corpus of tweets by my favorite twitter user, @AccioBae. You can find the github repo here. It implements a Markov chain on the corpus of tweets. The first implementation wasn’t that effective, but I modified the algorithm slightly to get slightly better coherent results. You can find the twitter account here. I named it “H.Bustos Domecq“. A little while later, I wondered if I could do the same with WhatsApp as well. Upon a little searching I came across a github repository of a Python library which interacts with WhatsApp. This was called “yowsup“. I used it’s echo implementation demo for a while, to great results, albeit temporarily. The very next morning I found that WhatsApp had blocked that number. I requested an unblock by telling them that I had no malicious or spammy intent. They said no. Oh well. # Exporting Nike+ run data to Strava and as a CSV file. Oh man where do I begin? I’ve just finished doing this and it has left me exhausted and weary and I have no clue where or how to begin. But like Seligman tells Joe, let me start at the very beginning. I’ve been running more or less consistently over the past four years or so and have been using Nike+ Running to track my runs. At the start I was blown away by Nike+, recommending it to everyone I talked to quite enthusiastically and telling them to “add” me so that we can “compete”. I managed to convince about 12 people to join me. This resulted in quite a bit of healthy competition and more Nike+ love. Also, I loved all the statistics and trophies which Nike+ had to offer. I made sure to run on all 4 of my birthdays from 2012 to 2016 just because I wanted to earn the Extra Frosting badge. Also, there were Nike+ levels. Aah the memories! I still vividly remember that run when I hit Green level. It was raining and I kept running till I couldn’t run anymore. I ran 7k at once. That was probably one of the best runs of my life. As the years went on, I bought a Nike+ SportBand because I was finding it difficult to run with my phone. The SportBand works with a tiny shoe pod which goes into a tiny slot inside a Nike+ shoe and gets connected to the SportBand while running. Then it tracks your pace, distance, calories burnt and other things like that. I used it for about three years. Now the shoe pod has an irreplacable battery and has, according to Nike, “1000 hours of battery life”. Slowly, Nike decided to phase out the SportBand and the shoe pod and also they stopped making shoes with the shoe pod slots in them. That got me paranoid. What if my shoe pod dies suddenly? Then my SportBand would die too, because Nike isn’t selling standalone shoe pods anymore! So I decided, with a heavy heart, to look at other run trackers out there in the market. There were tons of them! I heard a bunch of good reviews about Strava, so I decided to try that. All I had to do was transfer all of my runs onto Strava and poof, start anew! Unfortunately, things weren’t that simple. Nike, it turned out, were maniacal about their data policies. Somehow they thought that MY run data were THEIR property, and did not allow you to export and download run data. This was the first thing that pissed me off. But I didn’t lose my shit completely, because hey, there are worse companies out there. It so happened that before I lost my shit completely, I ordered a Nike+ SportWatch GPS to track my runs. Later, Nike put up this idiotic new website which seemingly got rid of Nike+ levels and the trophies too. This was the final straw. I lost my shit completely and went on a twitter rant. But I there wasn’t anything that could be done. So I went ahead and looked to Strava. The first thing was to find a way to export my Nike+ run data to Strava. I searched a lot and found about three websites which promised to do that but neither of them worked. Finally I stumbled on this beautifully designed website which did the whole exporting in 4 simple steps. But it was too painful to do this every time after I used my SportWatch for a run. Then I searched if I could automate it, and I even thought of writing my own script for it. But I found a simpler app which does the same thing. So I was saved! Now I also had this idea of sorts to download my run data and do a statistical analysis on it to get a better understanding of my runs. This was what I did today. To do that, I first installed this python package I found called “nikeplusapi“. Install it using pip2.7. Next, since I wanted to write a BASH script to download the data, I wanted to get a tool which parses JSON data. jshon was the answer to my problems. Finally, here is the bash code which gets this shit done. #!/bin/bash # Get the JSON data and store it in test.json. curl -k 'https://developer.nike.com/services/login' --data-urlencode username='EMAIL_ID_HERE' --data-urlencode password='PASSWORD_HERE' > test.json # Make jshon read the test.json data. jshon < test.json # Take out the Access Token from the json data. ACCESS_TOKEN=$(jshon -e access_token < test.json | tr -d '"') # Get your latest run data from the Nike Developer website and store it into a file. nikeplusapi -t $ACCESS_TOKEN > output # Store the relevant data into a variable. NEW_DATA=$(awk '{if (NR==2) print}' output)

# Push the latest run data into the old dataset containing all runs.
echo \$NEW_DATA >> /home/kody/nikerundata.csv

# Clean up.
rm test.json
rm output


Also, I modified the nikeplusapi code to display exactly the last workout’s data and nothing else. That is what I add to the existing CSV file in the Bash script above. The final data is now stored in nikerundata.csv and now we can do our magic on it in R!

This Bash script is messy and gives out a bunch of errors on execution, but hey man, it works for now. That’s all I need.

# Analysis of Doppler Ultrasound in Predicting Malignancy.

A while back I happened to come across data from a hospital which consisted of Doppler ultrasound data of patients at the hospital. The data consisted of technical parameters related to the ultrasound and finally, a “final diagnosis” of the patient, which could be either “Malignant” or “Benign”. The doctor who provided the data asked if I could see any trend in the technical parameters in predicting the final diagnosis.

I decided to have a go at it since it would be a good statistics refresher and some practice in R.

I found a bunch of interesting observations in the data and at the risk of tiring myself by explaining it all twice, I’m just going to point to the github repository of this project. All the details are in the pdf file in that repository.

# Zipf’s law, Power-law distributions in “The Anatomy of Melancholy” – Part II

The last post ended with me discovering a Zipf-like curve in the rank-frequency histogram of words in the Anatomy. The real problem was now to verify if the distribution was indeed explained by Zipf’s law. In the last post we saw that Zipf’s law was a special case of a more general family of distributions called “power law” distributions.

A discrete random variable $X$ is said to follow a power law if it’s density looks like $p(x) = P(X = x) = C x^{-\alpha}$ Where $\alpha > 0$ and $C$ is a normalizing constant. We assume $X$ is nonnegative integer valued. Clearly, for $x = 0$ the density diverges and so that equation cannot hold for all $x \geq 0$ and hence, there must be a quantity $x_{\text{min}}>0$ such that the above power law behaviour is followed.

One can easily check that the value of $C$ is given by  $\frac{1}{\zeta(\alpha,x_{\text{min}})}$ where $\zeta(\alpha, x_{\text{min}}) = \sum_{n=0}^{\infty}(n+x_{\text{min}})^{-\alpha}$ is the generalized Hurwitz zeta function. So the parameters of a power law are $\alpha$ and $x_{\text{min}}$. If we suspect that our data comes from a power law, we first need to estimate the quantities $\alpha$ and $x_{\text{min}}$.

So upon searching for ways to confirm if the distribution was indeed coming from a power law, I chanced upon a paper of Clauset,Shalizi and Newman (2009) which outlines an explicit recipe to be followed for the verification process.

1. Estimate the parameters $x_{\text{min}}$ and $\alpha$ of the power-law model.
2. Calculate the goodness-of-fit between the data and the power law. If the resulting p-value is greater than 0.1 the power law is a plausible hypothesis for the data, otherwise it is rejected.
3. Compare the power law with alternative hypotheses via a loglikelihood ratio test. For each alternative, if the calculated loglikelihood ratio is significantly different from zero, then its sign indicates whether the alternative is favored over the power-law model or not.

Their paper elaborates on each of the above steps, specifically on how to carry them out. Then they consider about 20 data sets and carry out this recipe on each of them.

Quickly giving the main steps :

They estimate $\alpha$ by  giving it it’s Maximum Likelihood Estimator in the continuous case and give it an approximation in the discrete case as there is no closed form formula in the discrete case. Next, $x_{\text{min}}$ is estimated by creating a power law fit starting from each unique value in the dataset, then selecting the one that results in the minimal Kolmogorov-Smirnov distance, $D$ between the data and the fit.

Now given the observed data and the estimated parameters from the previous step, we can come up with a hypothesized power law distribution and say that the observed data come from the hypothesized distribution. But we need to be sure of the goodness-of-fit. So, for this, we fit the power law using the estimated parameters and calculate the Kolmogorov-Smirnov statistic for this fit. Next, we generate a large number of power-law distributed synthetic data sets with scaling parameter $\alpha$ and lower bound $x_{\text{min}}$ equal to those of the distribution that best fits the observed data. We fit each synthetic data set individually to its own power-law model and calculate the Kolmogorov-Smirnov statistic for each one relative to its own model. Then we simply count what fraction of the time the resulting statistic is larger than the value for the empirical data. This fraction is the $p$-value. Check if this $p$-value is greater than 0.1. The specifics of how this is carried out is given in the paper.

To make sure that the fitted power law explains the data better than another candidate distribution, say like lognormal or exponential, we then conduct a loglikelihood ratio test. For each alternative, if the calculated loglikelihood ratio is significantly different from zero, then its sign indicates whether the alternative is favored over the power-law model or not.

Thankfully, some great souls have coded the above steps into a python library called the powerlaw library. So all I had to do was download and install the powerlaw library (it was available in the Arch User Repository) and then code away!

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#
#

""" Using the powerlaw package to do analysis of The Anatomy of Melancholy. """
""" We use the steps given in Clauset,Shalizi,Newman (2007) for the analysis."""

from collections import Counter
from math import log
import powerlaw
import numpy as np
import matplotlib.pyplot as plt

file = "TAM.txt"

with open(file) as mel:
words = contents.split()

""" Gives a list of tuples of most common words with frequencies """
comm = Counter(words).most_common(20000)

""" Isolate the words and frequencies and also assign ranks to the words """
labels = [i[0] for i in comm]
values = [i[1] for i in comm]
ranks = [labels.index(i)+1 for i in labels]

""" Step 1 : Estimate x_min and alpha """
fit= powerlaw.Fit(values, discrete=True)
alpha = fit.alpha
x_min = fit.xmin
print("\nxmin is: " ,x_min,)
print("Scaling parameter is: ",alpha,)

""" Step 1.5 : Visualization by plotting PDF, CDF and CCDF """
fig = fit.plot_pdf(color='b',original_data=True,linewidth=1.2)
fit.power_law.plot_pdf(color='b',linestyle='--',ax=fig)
fit.plot_ccdf(color='r', linewidth=1.2, ax=fig)
fit.power_law.plot_ccdf(color='r',linestyle='--',ax=fig)
plt.ylabel('PDF and CCDF')
plt.xlabel('Word Frequency')
plt.show()

""" Step 2&3 : Evaluating goodness of fit by this with candidate distribitions """
R1,p1 = fit.distribution_compare('power_law','stretched_exponential',normalized_ratio=True)
R2,p2 = fit.distribution_compare('power_law','exponential',normalized_ratio=True)
R3,p3 = fit.distribution_compare('power_law','lognormal_positive',normalized_ratio=True)
R4,p4 = fit.distribution_compare('power_law','lognormal',normalized_ratio=True)

print("Loglikelihood and p-value for stretched exponential: ",R1," ",p1,)
print("Loglikelihood and p-value for exponential: ",R2," ",p2,)
print("Loglikelihood and p-value for lognormal positive: ",R3," ",p3,)
print("Loglikelihood and p-value for lognormal: ",R4," ",p4,)

""" One notices that lognormal and power_law are very close in their fit for the data."""
fig1 = fit.plot_ccdf(linewidth=2.5)
fit.power_law.plot_ccdf(ax=fig1,color='r',linestyle='--')
fit.lognormal.plot_ccdf(ax=fig1,color='g',linestyle='--')
plt.xlabel('Word Frequency')
plt.ylabel('CCDFs of data, power law and lognormal.')
plt.title('Comparison of CCDFs of data and fitted power law and lognormal distribitions.')
plt.show(fig1)


So here were the results.

The estimated scaling parameter was $\widehat{\alpha} = 2.0467$ and $\widehat{x_{\text{min}}}=9$

The loglikelihood ratio of powerlaw against stretched exponential was $4.2944$ and the $p$-value was $1.75 \times 10^{-5}$. So we reject stretched exponential.

The loglikelihood ratio of powerlaw against exponential was $11.0326$ and the $p$-value was $2.66 \times 10^{-28}$. So we reject exponential.

The loglikelihood ratio of powerlaw against stretched lognormal positive was $6.072$ and the $p$-value was $1.26 \times 10^{-9}$. So we reject lognormal positive.

The loglikelihood ratio of powerlaw against lognormal was $0.307$ and the $p$-value was $0.75871$.

To be honest, I didn’t know what to do with the last one. Since we had positive loglikelihood ratio, that means that the powerlaw is favoured over lognormal, but only ever so slightly.

So the questions now remain : should I be happy with power law or should I prefer lognormal? Also, is there a test which helps us decide between the power law and lognormal distributions?

As far as I know, these questions are still open. Anyway, I think I shall give it a rest here and maybe take this up later. All that is left now is satisfication that I have beaten melancholy by writing about The Anatomy of Melancholy. (Temporarily at least!)

# Zipf’s law, Power-law distributions in “The Anatomy of Melancholy” – Part I

A while ago while trying to understand depression (a.k.a. why I was so fucking sad all the time), I came across a spectacular book which immediately caught my fancy. It was a 17th century book on (quite literally) The Anatomy of Melancholy written by a spectacular dude called Robert Burton. The Anatomy was first published in 1621 and was revised and republished five times by the author during his lifetime. As soon as I discovered it, I wanted to lay my hands on it and read it fully, but very soon I lost hope of that altogether.

The work itself is mind blowingly huge. To quote a reviewer in Goodreads :

And for you perverts, here is how the length of The Anatomy shakes out.

439 pages — Democritus (Burton’s persona) To The Reader and other front matter (125 pages) & First Partition.
261 pages — Second Partition
432 pages — Third Partition
Which amounts to 1132 pages. The remainder of its 1424 pages (292) consists of 6817 endnotes, (which are painlessly skippable), introductions, a glossary, and an index; unless you’ve got that ‘every damn page’ project in mind.

And mind you, the prose is difficult 17th century English. Critics have called it,”The Book to End all Books“. Burton himself writes that, “I write of melancholy by being busy to avoid melancholy.”

Though itself stating that it focusses on melancholy, the Anatomy in fact delves into much much more. “It uses melancholy as the lens through which all human emotion and thought may be scrutinized, and virtually the entire contents of a 17th-century library are marshalled into service of this goal. It is encyclopedic in its range and reference.”

Our good friends at the Project Gutenberg have made the entire text of the Anatomy available for free to the public. You can access the entire text here.

About the same time as this, I discovered Zipf’s law and it’s sister laws: Benford’s law and the Pareto distribution. (Terry tao has a nice post describing all three and how they relate to each other.)

Zipf’s law is an empirical law which says that certain data sets can be approximated by the Zipfian distribution, which is a part of a family of more general discrete power-law probability distributions. More precisely, Zipf’s law states that if $X$ is a discrete random variable, then the $n^{th}$ largest value of $X$ should be approximately $C n^{-\alpha}$ for the first few $n=1,2,3 \ldots$ and parameters $C, \alpha >0$. Of course, this does not hold for any discrete random variable $X$. A natural question is to ask which $X$ follows Zipf’s law. As far as I know, apart from a few general comments about $X$, nothing further can be said regarding this question. Terry Tao says the above laws are seen to approximate the distribution of many statistics $X$ which

1. take values as positive numbers
2. range over many different orders of magnitude
3. arise from a complicated combination of many largely independent different factors
4. have not been artifically rounded or truncated

Tao gives examples where, if hypotheses 1 and 4 are dropped, other laws rather than ones like Zipf’s law come into play.

Zipf’s law posits an inverse relation between the ranks and frequencies, which is only common sense. So we look at a text, say the Anatomy, look at each word in it, note its frequency and then assign a rank to each word based on its frequency. So one would expect the words “and”,”the” and “if” to have a really low rank, and thus a really high frequency. One can see this clearly in the histogram below. The word “and” is ranked 1 and it has the highest frequency among all other words.

Say now, that on a whim, (to test my insane python skillz) I write a program to fetch the entire text of the Anatomy and then with the text, I make a histogram of words and their frequencies. Here’s the program. (The whole book is available in plain text by the way. So I just had to wget the whole thing available here.)

"""Let us look at the word frequency in The Anatomy of Melancholy"""

from collections import Counter
from math import log
import numpy as np
import matplotlib.pyplot as plt

file = "TAM.txt"

with open(file) as mel:
words = contents.split()

"""gives a list of tuples of most common words with frequencies."""
comm = Counter(words).most_common(100)

""" Isolate the words and frequencies and also assign ranks to the words. """
labels = [i[0] for i in comm]
values = [i[1] for i in comm]
ranks = [labels.index(i)+1 for i in labels]

indexes = np.arange(len(labels))
width = 0.2

"""Histogram of word frequencies"""
plt.title("Frequency of words in 'The Anatomy of Melancholy' ")
plt.xlabel("Ranked words")
plt.ylabel("Frequency")
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels, rotation='vertical')
plt.show()


This then gives us the following graph.

Now compare this with the picture below.

Doesn’t it make you want to scream, “Zipf!”?

Just to add even more weight to our hypothesis, let’s just plot the log of the frequencies against the log of the ranks of the words. If there is a power-law being followed here, we would expect the log-log graph of ranks and frequencies to be linear.

So put that into the code.

""" Log-Log graph of ranks and frequencies """

logvals=[log(x) for x in values]
logranks=[log(x) for x in ranks]

plt.title("Log-Log graph of ranks and frequencies of words in 'The Anatomy of Melancholy' ")
plt.xlabel("logranks")
plt.ylabel("logvals")
plt.scatter(logranks,logvals)
plt.show()


This now gives us the following plot.

Hmm. Seems like there is a quantity $x_{\text{min}}$ after which the plot is almost linear. That is, it looks like the tail follows a power-law. Naturally, at this stage, we would want to do a least squares linear regression and fit a line to this plot, and if it’s a good fit, use that to conclude that we have a power-law!

Unfortunately, it’s not that simple. A lot of distributions give a straight-ish line in a log-log plot. So that is just not enough.

Also, like how Cosma Rohilla Shalizi states in his blog, least squares linear regression on a log-log plot is a bad idea, because even if your data follows a power-law it gives a bad estimate on the parameters $x{{\text{min}}$ and $\alpha$. Cosma even adds that even though this was what Pareto essentially did in 1890, “there is a time and place to be old school, and this is not it”.

We talked about $x_{\text{min}}$ above. How do we get an estimate of that? How do we know where the power-law starts?

Well, up till 2009, no one knew how to answer them. Then there was a paper by Clauset,Shalizi and Newman which set all these matters to rest.

Indeed, to state whether a given data set is approximated satisfactorily by a power-law such as Zipf’s law is quite a tricky business, and it is this question which we shall be tackling later on. I hope to write another blog post after I’ve read through the paper and coded their recipe.

Till then, cheers!