Probability Plots, Normal Distributions 'in nature'

Generate 5000 normally distributed values with given mean and standard deviation and plot the histogram

In [1]:
from pylab import *
from scipy.stats import norm
def gen_norm(N,mu,sigma):
    return sigma*norm.ppf(random(N))+mu
In [2]:
values=gen_norm(5000,2,3)
In [3]:
#divide values into 20 equal sized bins between -6 and 11
#and draw the histogram
bins=linspace(-6,11,20)
h=histogram(values,bins)
midpoints=[(h[1][i]+h[1][i+1])/2 for i in range(19)]
stem(midpoints,h[0])
show()

Superimpose the normal density

In [4]:
## Superimpose the plot of the normal density with the given mean and standard deviation, adjusted to same height
x=linspace(-6,11,200)
y=exp(-(x-2)**2/18)*max(h[0])
plot(x,y,color='k')
stem(midpoints,h[0])
show()

Make a qq-plot of the simulated data

In [5]:
from scipy.stats import probplot
u=probplot(values,plot=plt)
plt.axhline(color='k')
plt.axvline(color='k')
plt.show()

In contrast, look at the qq-plot when we try something that isn't normal

In [6]:
# generate exponentially distributed data
expvalues=-log(random(5000))
u=probplot(expvalues,plot=plt)
plt.show()

Now work with real data -- plot histogram of human height data (9999 18-year-olds)

In [6]:
f=open('heightextract','r')
hlist=[]
for s in f:
    hlist.append(float(s))
    
bins=linspace(min(hlist),max(hlist),40)
u=histogram(hlist,bins)
v=[(u[1][i]+u[1][i+1])/2 for i in range(39)]
#plot height histogram with max height normalized
    #to 1
nor=max(u[0])
stem(v,u[0]/nor)
xlim(min(hlist),max(hlist))
show()

Estimate mean and variance from data

In [7]:
hlist=array(hlist)
mubar=sum(hlist)/len(hlist)
varbar=sum(hlist**2)/len(hlist)-mubar**2

Superimpose normal density

In [8]:
x=linspace(min(hlist),max(hlist),500)
y=exp(-(x-mubar)**2/(2*varbar))
plot(x,y,color='k')
stem(v,u[0]/nor)
xlim(min(hlist),max(hlist))
show()

Make a qq plot

In [9]:
u=probplot(hlist,plot=plt)
plt.axhline(y=55,color='k')
plt.axvline(color='k')
plt.show()