Exporting Nike+ run data to Strava and as a CSV file.

Oh man where do I begin?

I’ve just finished doing this and it has left me exhausted and weary and I have no clue where or how to begin. But like Seligman tells Joe, let me start at the very beginning.

I’ve been running more or less consistently over the past four years or so and have been using Nike+ Running to track my runs. At the start I was blown away by Nike+, recommending it to everyone I talked to quite enthusiastically and telling them to “add” me so that we can “compete”. I managed to convince about 12 people to join me. This resulted in quite a bit of healthy competition and more Nike+ love. Also, I loved all the statistics and trophies which Nike+ had to offer. I made sure to run on all 4 of my birthdays from 2012 to 2016 just because I wanted to earn the Extra Frosting badge. Also, there were Nike+ levels. Aah the memories! I still vividly remember that run when I hit Green level. It was raining and I kept running till I couldn’t run anymore. I ran 7k at once. That was probably one of the best runs of my life.

As the years went on, I bought a Nike+ SportBand because I was finding it difficult to run with my phone. The SportBand works with a tiny shoe pod which goes into a tiny slot inside a Nike+ shoe and gets connected to the SportBand while running. Then it tracks your pace, distance, calories burnt and other things like that. I used it for about three years. Now the shoe pod has an irreplacable battery and has, according to Nike, “1000 hours of battery life”.

Slowly, Nike decided to phase out the SportBand and the shoe pod and also they stopped making shoes with the shoe pod slots in them. That got me paranoid. What if my shoe pod dies suddenly? Then my SportBand would die too, because Nike isn’t selling standalone shoe pods anymore!

So I decided, with a heavy heart, to look at other run trackers out there in the market. There were tons of them! I heard a bunch of good reviews about Strava, so I decided to try that. All I had to do was transfer all of my runs onto Strava and poof, start anew!

Unfortunately, things weren’t that simple. Nike, it turned out, were maniacal about their data policies. Somehow they thought that MY run data were THEIR property, and did not allow you to export and download run data. This was the first thing that pissed me off. But I didn’t lose my shit completely, because hey, there are worse companies out there.

It so happened that before I lost  my shit completely, I ordered a Nike+ SportWatch GPS to track my runs.

Later, Nike put up this idiotic new website which seemingly got rid of Nike+ levels and the trophies too. This was the final straw. I lost my shit completely and went on a twitter rant. But I there wasn’t anything that could be done. So I went ahead and looked to Strava.

The first thing was to find a way to export my Nike+ run data to Strava. I searched a lot and found about three websites which promised to do that but neither of them worked. Finally I stumbled on this beautifully designed website which did the whole exporting in 4 simple steps. But it was too painful to do this every time after I used my SportWatch for a run. Then I searched if I could automate it, and I even thought of writing my own script for it. But I found a simpler app which does the same thing. So I was saved!

Now I also had this idea of sorts to download my run data and do a statistical analysis on it to get a better understanding of my runs. This was what I did today. To do that, I first installed this python package I found called “nikeplusapi“. Install it using pip2.7.
Next, since I wanted to write a BASH script to download the data, I wanted to get a tool which parses JSON data. jshon was the answer to my problems.

Finally, here is the bash code which gets this shit done.

#!/bin/bash

# Get the JSON data and store it in test.json. 
curl -k 'https://developer.nike.com/services/login' --data-urlencode username='EMAIL_ID_HERE' --data-urlencode password='PASSWORD_HERE' > test.json

# Make jshon read the test.json data.
jshon < test.json 

# Take out the Access Token from the json data.
ACCESS_TOKEN=$(jshon -e access_token < test.json | tr -d '"') # Get your latest run data from the Nike Developer website and store it into a file. nikeplusapi -t $ACCESS_TOKEN > output

# Store the relevant data into a variable.
NEW_DATA=$(awk '{if (NR==2) print}' output)

# Push the latest run data into the old dataset containing all runs. 
echo $NEW_DATA >> /home/kody/nikerundata.csv

# Clean up. 
rm test.json
rm output

Also, I modified the nikeplusapi code to display exactly the last workout’s data and nothing else. That is what I add to the existing CSV file in the Bash script above. The final data is now stored in nikerundata.csv and now we can do our magic on it in R!

This Bash script is messy and gives out a bunch of errors on execution, but hey man, it works for now. That’s all I need.

 

 

 

 

Advertisements

Analysis of Doppler Ultrasound in Predicting Malignancy.

A while back I happened to come across data from a hospital which consisted of Doppler ultrasound data of patients at the hospital. The data consisted of technical parameters related to the ultrasound and finally, a “final diagnosis” of the patient, which could be either “Malignant” or “Benign”. The doctor who provided the data asked if I could see any trend in the technical parameters in predicting the final diagnosis.

I decided to have a go at it since it would be a good statistics refresher and some practice in R.

I found a bunch of interesting observations in the data and at the risk of tiring myself by explaining it all twice, I’m just going to point to the github repository of this project. All the details are in the pdf file in that repository.

Zipf’s law, Power-law distributions in “The Anatomy of Melancholy” – Part II

The last post ended with me discovering a Zipf-like curve in the rank-frequency histogram of words in the Anatomy. The real problem was now to verify if the distribution was indeed explained by Zipf’s law. In the last post we saw that Zipf’s law was a special case of a more general family of distributions called “power law” distributions.

A discrete random variable X is said to follow a power law if it’s density looks like p(x) = P(X = x) = C x^{-\alpha} Where \alpha > 0 and C is a normalizing constant. We assume X is nonnegative integer valued. Clearly, for x = 0 the density diverges and so that equation cannot hold for all x \geq 0 and hence, there must be a quantity x_{\text{min}}>0 such that the above power law behaviour is followed.

One can easily check that the value of C is given by  \frac{1}{\zeta(\alpha,x_{\text{min}})} where \zeta(\alpha, x_{\text{min}}) = \sum_{n=0}^{\infty}(n+x_{\text{min}})^{-\alpha} is the generalized Hurwitz zeta function. So the parameters of a power law are \alpha and x_{\text{min}}. If we suspect that our data comes from a power law, we first need to estimate the quantities \alpha and x_{\text{min}}.

So upon searching for ways to confirm if the distribution was indeed coming from a power law, I chanced upon a paper of Clauset,Shalizi and Newman (2009) which outlines an explicit recipe to be followed for the verification process.

  1. Estimate the parameters x_{\text{min}} and \alpha of the power-law model.
  2. Calculate the goodness-of-fit between the data and the power law. If the resulting p-value is greater than 0.1 the power law is a plausible hypothesis for the data, otherwise it is rejected.
  3. Compare the power law with alternative hypotheses via a loglikelihood ratio test. For each alternative, if the calculated loglikelihood ratio is significantly different from zero, then its sign indicates whether the alternative is favored over the power-law model or not.

Their paper elaborates on each of the above steps, specifically on how to carry them out. Then they consider about 20 data sets and carry out this recipe on each of them.

Quickly giving the main steps :

They estimate \alpha by  giving it it’s Maximum Likelihood Estimator in the continuous case and give it an approximation in the discrete case as there is no closed form formula in the discrete case. Next, x_{\text{min}} is estimated by creating a power law fit starting from each unique value in the dataset, then selecting the one that results in the minimal Kolmogorov-Smirnov distance, D between the data and the fit.

Now given the observed data and the estimated parameters from the previous step, we can come up with a hypothesized power law distribution and say that the observed data come from the hypothesized distribution. But we need to be sure of the goodness-of-fit. So, for this, we fit the power law using the estimated parameters and calculate the Kolmogorov-Smirnov statistic for this fit. Next, we generate a large number of power-law distributed synthetic data sets with scaling parameter \alpha and lower bound x_{\text{min}} equal to those of the distribution that best fits the observed data. We fit each synthetic data set individually to its own power-law model and calculate the Kolmogorov-Smirnov statistic for each one relative to its own model. Then we simply count what fraction of the time the resulting statistic is larger than the value for the empirical data. This fraction is the p-value. Check if this p-value is greater than 0.1. The specifics of how this is carried out is given in the paper.

To make sure that the fitted power law explains the data better than another candidate distribution, say like lognormal or exponential, we then conduct a loglikelihood ratio test. For each alternative, if the calculated loglikelihood ratio is significantly different from zero, then its sign indicates whether the alternative is favored over the power-law model or not.

Thankfully, some great souls have coded the above steps into a python library called the powerlaw library. So all I had to do was download and install the powerlaw library (it was available in the Arch User Repository) and then code away!

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#
# Copyright © 2016 kody <kody@kodick>
#
# Distributed under terms of the MIT license.

""" Using the powerlaw package to do analysis of The Anatomy of Melancholy. """
""" We use the steps given in Clauset,Shalizi,Newman (2007) for the analysis."""

from collections import Counter
from math import log
import powerlaw
import numpy as np
import matplotlib.pyplot as plt

file = "TAM.txt"

with open(file) as mel:
    contents = mel.read()
    words = contents.split()

""" Gives a list of tuples of most common words with frequencies """
comm = Counter(words).most_common(20000)

""" Isolate the words and frequencies and also assign ranks to the words """
labels = [i[0] for i in comm]
values = [i[1] for i in comm]
ranks = [labels.index(i)+1 for i in labels]

""" Step 1 : Estimate x_min and alpha """
fit= powerlaw.Fit(values, discrete=True)
alpha = fit.alpha
x_min = fit.xmin
print("\nxmin is: " ,x_min,)
print("Scaling parameter is: ",alpha,)

""" Step 1.5 : Visualization by plotting PDF, CDF and CCDF """
fig = fit.plot_pdf(color='b',original_data=True,linewidth=1.2)
fit.power_law.plot_pdf(color='b',linestyle='--',ax=fig)
fit.plot_ccdf(color='r', linewidth=1.2, ax=fig)
fit.power_law.plot_ccdf(color='r',linestyle='--',ax=fig)
plt.ylabel('PDF and CCDF')
plt.xlabel('Word Frequency')
plt.show()

""" Step 2&3 : Evaluating goodness of fit by this with candidate distribitions """
R1,p1 = fit.distribution_compare('power_law','stretched_exponential',normalized_ratio=True)
R2,p2 = fit.distribution_compare('power_law','exponential',normalized_ratio=True)
R3,p3 = fit.distribution_compare('power_law','lognormal_positive',normalized_ratio=True)
R4,p4 = fit.distribution_compare('power_law','lognormal',normalized_ratio=True)

print("Loglikelihood and p-value for stretched exponential: ",R1," ",p1,)
print("Loglikelihood and p-value for exponential: ",R2," ",p2,)
print("Loglikelihood and p-value for lognormal positive: ",R3," ",p3,)
print("Loglikelihood and p-value for lognormal: ",R4," ",p4,)

""" One notices that lognormal and power_law are very close in their fit for the data."""
fig1 = fit.plot_ccdf(linewidth=2.5)
fit.power_law.plot_ccdf(ax=fig1,color='r',linestyle='--')
fit.lognormal.plot_ccdf(ax=fig1,color='g',linestyle='--')
plt.xlabel('Word Frequency')
plt.ylabel('CCDFs of data, power law and lognormal.')
plt.title('Comparison of CCDFs of data and fitted power law and lognormal distribitions.')
plt.show(fig1)

So here were the results.

The estimated scaling parameter was \widehat{\alpha} = 2.0467 and \widehat{x_{\text{min}}}=9

The loglikelihood ratio of powerlaw against stretched exponential was 4.2944 and the p-value was 1.75 \times 10^{-5}. So we reject stretched exponential.

The loglikelihood ratio of powerlaw against exponential was 11.0326 and the p-value was 2.66 \times 10^{-28}. So we reject exponential.

The loglikelihood ratio of powerlaw against stretched lognormal positive was 6.072 and the p-value was 1.26 \times 10^{-9}. So we reject lognormal positive.

The loglikelihood ratio of powerlaw against lognormal was 0.307 and the p-value was 0.75871.

To be honest, I didn’t know what to do with the last one. Since we had positive loglikelihood ratio, that means that the powerlaw is favoured over lognormal, but only ever so slightly.

So the questions now remain : should I be happy with power law or should I prefer lognormal? Also, is there a test which helps us decide between the power law and lognormal distributions?

As far as I know, these questions are still open. Anyway, I think I shall give it a rest here and maybe take this up later. All that is left now is satisfication that I have beaten melancholy by writing about The Anatomy of Melancholy. (Temporarily at least!)