The last post ended with me discovering a Zipf-like curve in the rank-frequency histogram of words in the Anatomy. The real problem was now to verify if the distribution was indeed explained by Zipf’s law. In the last post we saw that Zipf’s law was a special case of a more general family of distributions called “power law” distributions.
A discrete random variable is said to follow a power law if it’s density looks like Where and is a normalizing constant. We assume is nonnegative integer valued. Clearly, for the density diverges and so that equation cannot hold for all and hence, there must be a quantity such that the above power law behaviour is followed.
One can easily check that the value of is given by where is the generalized Hurwitz zeta function. So the parameters of a power law are and . If we suspect that our data comes from a power law, we first need to estimate the quantities and .
So upon searching for ways to confirm if the distribution was indeed coming from a power law, I chanced upon a paper of Clauset,Shalizi and Newman (2009) which outlines an explicit recipe to be followed for the verification process.
- Estimate the parameters and of the power-law model.
- Calculate the goodness-of-fit between the data and the power law. If the resulting p-value is greater than 0.1 the power law is a plausible hypothesis for the data, otherwise it is rejected.
- Compare the power law with alternative hypotheses via a loglikelihood ratio test. For each alternative, if the calculated loglikelihood ratio is significantly different from zero, then its sign indicates whether the alternative is favored over the power-law model or not.
Their paper elaborates on each of the above steps, specifically on how to carry them out. Then they consider about 20 data sets and carry out this recipe on each of them.
Quickly giving the main steps :
They estimate by giving it it’s Maximum Likelihood Estimator in the continuous case and give it an approximation in the discrete case as there is no closed form formula in the discrete case. Next, is estimated by creating a power law fit starting from each unique value in the dataset, then selecting the one that results in the minimal Kolmogorov-Smirnov distance, between the data and the fit.
Now given the observed data and the estimated parameters from the previous step, we can come up with a hypothesized power law distribution and say that the observed data come from the hypothesized distribution. But we need to be sure of the goodness-of-fit. So, for this, we fit the power law using the estimated parameters and calculate the Kolmogorov-Smirnov statistic for this fit. Next, we generate a large number of power-law distributed synthetic data sets with scaling parameter and lower bound equal to those of the distribution that best fits the observed data. We fit each synthetic data set individually to its own power-law model and calculate the Kolmogorov-Smirnov statistic for each one relative to its own model. Then we simply count what fraction of the time the resulting statistic is larger than the value for the empirical data. This fraction is the -value. Check if this -value is greater than 0.1. The specifics of how this is carried out is given in the paper.
To make sure that the fitted power law explains the data better than another candidate distribution, say like lognormal or exponential, we then conduct a loglikelihood ratio test. For each alternative, if the calculated loglikelihood ratio is significantly different from zero, then its sign indicates whether the alternative is favored over the power-law model or not.
Thankfully, some great souls have coded the above steps into a python library called the powerlaw library. So all I had to do was download and install the powerlaw library (it was available in the Arch User Repository) and then code away!
#! /usr/bin/env python # -*- coding: utf-8 -*- # vim:fenc=utf-8 # # Copyright © 2016 kody <kody@kodick> # # Distributed under terms of the MIT license. """ Using the powerlaw package to do analysis of The Anatomy of Melancholy. """ """ We use the steps given in Clauset,Shalizi,Newman (2007) for the analysis.""" from collections import Counter from math import log import powerlaw import numpy as np import matplotlib.pyplot as plt file = "TAM.txt" with open(file) as mel: contents = mel.read() words = contents.split() """ Gives a list of tuples of most common words with frequencies """ comm = Counter(words).most_common(20000) """ Isolate the words and frequencies and also assign ranks to the words """ labels = [i for i in comm] values = [i for i in comm] ranks = [labels.index(i)+1 for i in labels] """ Step 1 : Estimate x_min and alpha """ fit= powerlaw.Fit(values, discrete=True) alpha = fit.alpha x_min = fit.xmin print("\nxmin is: " ,x_min,) print("Scaling parameter is: ",alpha,) """ Step 1.5 : Visualization by plotting PDF, CDF and CCDF """ fig = fit.plot_pdf(color='b',original_data=True,linewidth=1.2) fit.power_law.plot_pdf(color='b',linestyle='--',ax=fig) fit.plot_ccdf(color='r', linewidth=1.2, ax=fig) fit.power_law.plot_ccdf(color='r',linestyle='--',ax=fig) plt.ylabel('PDF and CCDF') plt.xlabel('Word Frequency') plt.show() """ Step 2&3 : Evaluating goodness of fit by this with candidate distribitions """ R1,p1 = fit.distribution_compare('power_law','stretched_exponential',normalized_ratio=True) R2,p2 = fit.distribution_compare('power_law','exponential',normalized_ratio=True) R3,p3 = fit.distribution_compare('power_law','lognormal_positive',normalized_ratio=True) R4,p4 = fit.distribution_compare('power_law','lognormal',normalized_ratio=True) print("Loglikelihood and p-value for stretched exponential: ",R1," ",p1,) print("Loglikelihood and p-value for exponential: ",R2," ",p2,) print("Loglikelihood and p-value for lognormal positive: ",R3," ",p3,) print("Loglikelihood and p-value for lognormal: ",R4," ",p4,) """ One notices that lognormal and power_law are very close in their fit for the data.""" fig1 = fit.plot_ccdf(linewidth=2.5) fit.power_law.plot_ccdf(ax=fig1,color='r',linestyle='--') fit.lognormal.plot_ccdf(ax=fig1,color='g',linestyle='--') plt.xlabel('Word Frequency') plt.ylabel('CCDFs of data, power law and lognormal.') plt.title('Comparison of CCDFs of data and fitted power law and lognormal distribitions.') plt.show(fig1)
So here were the results.
The estimated scaling parameter was and
The loglikelihood ratio of powerlaw against stretched exponential was and the -value was . So we reject stretched exponential.
The loglikelihood ratio of powerlaw against exponential was and the -value was . So we reject exponential.
The loglikelihood ratio of powerlaw against stretched lognormal positive was and the -value was . So we reject lognormal positive.
The loglikelihood ratio of powerlaw against lognormal was and the -value was .
To be honest, I didn’t know what to do with the last one. Since we had positive loglikelihood ratio, that means that the powerlaw is favoured over lognormal, but only ever so slightly.
So the questions now remain : should I be happy with power law or should I prefer lognormal? Also, is there a test which helps us decide between the power law and lognormal distributions?
As far as I know, these questions are still open. Anyway, I think I shall give it a rest here and maybe take this up later. All that is left now is satisfication that I have beaten melancholy by writing about The Anatomy of Melancholy. (Temporarily at least!)