#!/usr/bin/env python
# coding: utf-8
# In[1]:
from IPython.display import Image
Image('../../../python_for_probability_statistics_and_machine_learning.jpg')
# [Python for Probability, Statistics, and Machine Learning](https://www.springer.com/fr/book/9783319307152)
# In[2]:
from __future__ import division
import numpy as np
np.random.seed(1234)
# In[3]:
get_ipython().run_line_magic('pylab', 'inline')
# The estimation problem starts with the desire to infer something meaningful
# from data. For parametric estimation, the strategy is to postulate a model for
# the data and then use the data to fit model parameters. This leads to two
# fundamental questions: where to get the model and how to estimate the
# parameters? The first question is best answered by the maxim: *all models are
# wrong, some are useful*. In other words, choosing a model depends as much on
# the application as on the model itself. Think about models as building
# different telescopes to view the sky. No one would ever claim that the
# telescope generates the sky! It is same with data models. Models give us
# multiple perspectives on the data that themselves are proxies for some deeper
# underlying phenomenon.
#
# Some categories of data may be more commonly studied using certain types of
# models, but this is usually very domain-specific and ultimately depends on the
# aims of the analysis. In some cases, there may be strong physical reasons
# behind choosing a model. For example, one could postulate that the model is
# linear with some noise as in the following:
# $$
# Y = a X + \epsilon
# $$
# which basically says that you, as the experimenter, dial in some
# value for $X$ and then read off something directly proportional to $X$ as the
# measurement, $Y$, plus some additive noise that you attribute to jitter in the
# apparatus. Then, the next step is to estimate the paramater $a$ in the model,
# given some postulated claim about the nature of $\epsilon$. How to compute the
# model parameters depends on the particular methodology. The two broad rubrics
# are parametric and non-parametric estimation. In the former, we assume we know
# the density function of the data and then try to derive the embedded parameters
# for it. In the latter, we claim only to know that the density function is a
# member of a broad class of density functions and then use the data
# to characterize a member of that class. Broadly speaking, the former consumes
# less data than the latter, because there are fewer unknowns to compute from
# the data.
#
# Let's concentrate on parametric estimation for now. The tradition is to denote
# the unknown parameter to be estimated as $\theta$ which is a member of a large
# space of alternates, $\Theta$. To judge between potential $\theta$ values, we
# need an objective function, known as a *risk* function,
# $L(\theta,\hat{\theta})$, where $\hat{\theta}(\mathbf{x})$ is an
# estimate for the unknown $\theta$ that is derived from the available
# data $\mathbf{x}$. The most common and useful risk function is the
# squared error loss,
# $$
# L(\theta,\hat{\theta}) = (\theta-\hat{\theta})^2
# $$
# Although neat, this is not practical because we need to know the
# unknown $\theta$ to compute it. The other problem is because $\hat{\theta}$ is
# a function of the observed data, it is also a random variable with its own
# probability density function. This leads to the notion of the *expected risk*
# function,
# $$
# R(\theta,\hat{\theta}) = \mathbb{E}_\theta(L(\theta,\hat{\theta})) = \int L(\theta,\hat{\theta}(\mathbf{x})) f(\mathbf{x};\theta) d \mathbf{x}
# $$
# In other words, given a fixed $\theta$, integrate over the
# probability density function of the data, $f(\mathbf{x})$, to compute the
# risk. Plugging in for the squared error loss, we compute the
# mean squared error,
# $$
# \mathbb{E}_\theta(\theta-\hat{\theta})^2 =\int (\theta-\hat{\theta})^2 f(\mathbf{x};\theta) d \mathbf{x}
# $$
# This has the important factorization into the *bias*,
# $$
# \texttt{bias} = \mathbb{E}_\theta(\hat{\theta})-\theta
# $$
# with the corresponding variance, $\mathbb{V}_\theta(\hat{\theta})$ as
# in the following *mean squared error* (MSE):
# $$
# \mathbb{E}_\theta(\theta-\hat{\theta})^2= \texttt{bias}^2+\mathbb{V}_\theta(\hat{\theta})
# $$
# This is an important trade-off that we will return to repeatedly. The
# idea is the bias is nonzero when the estimator $\hat{\theta}$, integrated
# over all possible data, $f(\mathbf{x})$, does not equal the underlying target
# parameter $\theta$. In some sense, the estimator misses the target, no matter
# how much data is used. When the bias equals zero, the estimated is *unbiased*.
# For fixed MSE, low bias implies high variance and vice-versa. This trade-off
# was once not emphasized and instead much attention was paid to the smallest
# variance of unbiased estimators (see Cramer-Rao bounds). In practice,
# understanding and exploiting the trade-off between bias and variance and
# reducing the MSE is more important.
#
# With all this set up, we can now ask how bad can bad get by
# examining *minimax* risk,
# $$
# R_{\texttt{mmx}} = \inf_{\hat{\theta}} \sup_\theta R(\theta,\hat{\theta})
# $$
# where the $\inf$ is take over all estimators. Intuitively, this
# means if we found the worst possible $\theta$ and swept over all possible
# parameter estimators $\hat{\theta}$, and then took the smallest possible risk
# we could find, we would have the minimax risk. Thus, an estimator,
# $\hat{\theta}_{\texttt{mmx}}$, is a *minimax estimator* if it achieves this
# feat,
# $$
# \sup_\theta R(\theta,\hat{\theta}_{\texttt{mmx}}) =\inf_{\hat{\theta}} \sup_\theta R(\theta,\hat{\theta})
# $$
# In other words, even in the face of the worst $\theta$ (i.e., the
# $\sup_\theta$), $\hat{\theta}_{\texttt{mmx}}$ still achieves the minimax
# risk. There is a greater theory that revolves around minimax estimators of
# various kinds, but this is far beyond our scope here. The main thing to focus
# on is that under certain technical but easily satisfiable conditions, the
# maximum likelihood estimator is approximately minimax. Maximum likelihood is
# the subject of the next section. Let's get started with the simplest
# application: coin-flipping.
#
# ## Setting up the Coin Flipping Experiment
#
# Suppose we have coin and want to estimate the probability of heads ($p$) for
# it. We model the distribution of heads and tails as a Bernoulli distribution
# with the following probability mass function:
# $$
# \phi(x)= p^x (1-p)^{(1-x)}
# $$
# where $x$ is the outcome, *1* for heads and *0* for tails. Note that
# maximum likelihood is a parametric method that requires the specification of a
# particular model for which we will compute embedded parameters. For $n$
# independent flips, we have the joint density as the product of $n$ of
# these functions as in,
# $$
# \phi(\mathbf{x})=\prod_{i=1}^n p^x_i (1-p)^{(1-x_i)}
# $$
# The following is the *likelihood function*,
# $$
# \mathcal{L}(p ; \mathbf{x})= \prod_{i=1}^n p^{ x_i }(1-p)^{1-x_i}
# $$
# This is basically notation. We have just renamed the
# previous equation to emphasize the $p$ parameter, which is what
# we want to estimate.
#
# The principle of *maximum likelihood* is to maximize the likelihood as the
# function of $p$ after plugging in all of the $x_i$ data. We then call this
# maximizer $\hat{p}$ which is a function of the observed $x_i$ data, and as
# such, is a random variable with its own distribution. This method therefore
# ingests data and an assumed model for the probability density, and produces a
# function that estimates the embedded parameter in the assumed probability
# density. Thus, maximum likelihood generates the *functions* of data that we
# need in order to get at the underlying parameters of the model. Note that there
# is no limit to the ways we can functionally manipulate the data we have
# collected. The maximum likelihood principle gives us a systematic method for
# constructing these functions subject to the assumed model. This is a point
# worth emphasizing: the maximum likelihood principle yields functions as
# solutions the same way solving differential equations yields functions as
# solutions. It is very, very much harder to produce a function than to produce a
# value as a solution, even with the assumption of a convenient probability
# density. Thus, the power of the principle is that you can construct such
# functions subject to the model assumptions.
#
# ### Simulating the Experiment
#
# We need the following code to simulate coin flipping.
# In[4]:
from scipy.stats import bernoulli
p_true=1/2.0 # estimate this!
fp=bernoulli(p_true) # create bernoulli random variate
xs = fp.rvs(100) # generate some samples
print xs[:30] # see first 30 samples
# Now, we can write out the likelihood function using Sympy. Note
# that we give the Sympy variables the `positive=True` attribute upon
# construction because this eases Sympy's internal simplification algorithms.
# In[5]:
import sympy
x,p,z=sympy.symbols('x p z', positive=True)
phi=p**x*(1-p)**(1-x) # distribution function
L=np.prod([phi.subs(x,i) for i in xs]) # likelihood function
print L # approx 0.5?
# Note that, once we plug in the data, the likelihood function is
# solely a function of the unknown parameter ($p$ in this case). The following
# code uses calculus to find the extrema of the likelihood function. Note that
# taking the `log` of $L$ makes the maximization problem tractable but doesn't
# change the extrema.
# In[6]:
logL=sympy.expand_log(sympy.log(L))
sol,=sympy.solve(sympy.diff(logL,p),p)
print sol
# **Programming Tip.**
#
# Note that `sol,=sympy.solve` statement includes
# a comma after the `sol` variable. This is because the `solve`
# function returns a list containing a single element. Using
# this assignment unpacks that single element into the `sol` variable
# directly. This is another one of the many small elegancies of Python.
#
#
#
# The following code generates [Figure](#fig:Maximum_likelihood_10_2).
# In[7]:
fig,ax=subplots()
x=np.linspace(0,1,100)
ax.plot(x,map(sympy.lambdify(p,logL,'numpy'),x),'k-',lw=3)
ax.plot(sol,logL.subs(p,sol),'o',
color='gray',ms=15,label='Estimated')
ax.plot(p_true,logL.subs(p,p_true),'s',
color='k',ms=15,label='Actual')
ax.set_xlabel('$p$',fontsize=18)
ax.set_ylabel('Likelihood',fontsize=18)
ax.set_title('Estimate not equal to true value',fontsize=18)
ax.legend(loc=0)
# **Programming Tip.**
#
# In the prior code, we use the `lambdify` function in `lambdify(p,logL,'numpy')` to
# take a Sympy expression and convert it into a Numpy version that is easier to
# compute. The `lambdify` function has an extra argument where you can specify
# the function space that it should use to convert the expression. In the above
# this is set to Numpy.
#
#
#
#
#
#
#
# Maximum likelihood estimate vs. true parameter. Note that the estimate is slightly off from the true value. This is a consequence of the fact that the estimator is a function of the data and lacks knowledge of the true underlying value.
#
#
#
#
#
# [Figure](#fig:Maximum_likelihood_10_2) shows that our estimator $\hat{p}$
# (circle) is not equal to the true value of $p$ (square), despite being
# the maximum of the likelihood function. This may sound disturbing, but keep in
# mind this estimate is a function of the random data; and since that data can
# change, the ultimate estimate can likewise change. I invite you to run this
# code in the corresponding IPython notebook a few times to observe this.
# Remember that the estimator is a *function* of the data and is thus also a
# *random variable*, just like the data is. This means it has its own probability
# distribution with corresponding mean and variance. So, what we are observing is
# a consequence of that variance.
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# Histogram of maximum likelihood estimates. The title shows the estimated mean and standard deviation of the samples.
#
#
#
#
#
# [Figure](#fig:Maximum_likelihood_30_2) shows what happens when you run many
# thousands of coin experiments and compute the maximum likelihood
# estimate for each experiment, given a particular number of samples
# per experiment. This simulation gives us a histogram of the maximum likelihood
# estimates, which is an approximation of the probability distribution of the
# $\hat{p}$ estimator itself. This figure shows that the sample mean
# of the estimator ($\mu = \frac{1}{n}\sum \hat{p}_i$) is pretty close to the
# true value, but looks can be deceiving. The only way to know for sure is to
# check if the estimator is unbiased, namely, if
# $$
# \mathbb{E}(\hat{p}) = p
# $$
# Because this problem is simple, we can solve for this in general
# noting that the terms above are either $p$, if $x_i=1$ or $1-p$ if $x_i=0$.
# This means that we can write
# $$
# \mathcal{L}(p\vert \mathbf{x})= p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i}
# $$
# with corresponding logarithm as
# $$
# J=\log(\mathcal{L}(p\vert \mathbf{x})) = \log(p) \sum_{i=1}^n x_i + \log(1-p) \left(n-\sum_{i=1}^n x_i\right)
# $$
# Taking the derivative of this gives:
# $$
# \frac{dJ}{dp} = \frac{1}{p}\sum_{i=1}^n x_i + \frac{(n-\sum_{i=1}^n x_i)}{p-1}
# $$
# and solving this for $p$ leads to
# $$
# \hat{p} = \frac{1}{ n} \sum_{i=1}^n x_i
# $$
# This is our *estimator* for $p$. Up until now, we have been using Sympy to
# solve for this based on the data $x_i$ but now that we have it analytically we
# don't have to solve for it each time. To check if this estimator is biased, we
# compute its expectation:
# $$
# \mathbb{E}\left(\hat{p}\right) =\frac{1}{n}\sum_i^n \mathbb{E}(x_i) = \frac{1}{n} n \mathbb{E}(x_i)
# $$
# by linearity of the expectation and where
# $$
# \mathbb{E}(x_i) = p
# $$
# Therefore,
# $$
# \mathbb{E}\left(\hat{p}\right) =p
# $$
# This means that the estimator is *unbiased*. Similarly,
# $$
# \mathbb{E}\left(\hat{p}^2\right) = \frac{1}{n^2} \mathbb{E}\left[\left( \sum_{i=1}^n x_i \right)^2 \right]
# $$
# and where
# $$
# \mathbb{E}\left(x_i^2\right) =p
# $$
# and by the independence assumption,
# $$
# \mathbb{E}\left(x_i x_j\right) =\mathbb{E}(x_i)\mathbb{E}(x_j) =p^2
# $$
# Thus,
# $$
# \mathbb{E}\left(\hat{p}^2\right) =\left(\frac{1}{n^2}\right) n \left[ p+(n-1)p^2 \right]
# $$
# So, the variance of the estimator, $\hat{p}$, is the following:
# $$
# \mathbb{V}(\hat{p}) = \mathbb{E}\left(\hat{p}^2\right)- \mathbb{E}\left(\hat{p}\right)^2 = \frac{p(1-p)}{n}
# $$
# Note that the $n$ in the denominator means that the variance
# asymptotically goes to zero as $n$ increases (i.e., we consider more and
# more samples). This is good news because it means that more and
# more coin flips lead to a better estimate of the underlying $p$.
#
# Unfortunately, this formula for the variance is practically useless because we
# need $p$ to compute it and $p$ is the parameter we are trying to estimate in
# the first place! However, this is where the *plug-in* principle [^invariance-property]
# saves the day. It turns out in this situation, you can
# simply substitute the maximum likelihood estimator, $\hat{p}$, for the $p$ in
# the above equation to obtain the asymptotic variance for $\mathbb{V}(\hat{p})$.
# The fact that this works is guaranteed by the asymptotic theory of maximum
# likelihood estimators.
#
# [^invariance-property]: This is also known as the *invariance property*
# of maximum likelihood estimators. It basically states that the
# maximum likelihood estimator of any function, say, $h(\theta)$, is
# the same $h$ with the maximum likelihood estimator for $\theta$ substituted
# in for $\theta$; namely, $h(\theta_{ML})$.
#
# Nevertheless, looking at $\mathbb{V}(\hat{p})^2$, we can immediately notice
# that if $p=0$, then there is no estimator variance because the outcomes are
# guaranteed to be tails. Also, for any $n$, the maximum of this variance
# happens at $p=1/2$. This is our worst case scenario and the only way to
# compensate is with larger $n$.
#
# All we have computed is the mean and variance of the estimator. In general,
# this is insufficient to characterize the underlying probability density of
# $\hat{p}$, except if we somehow knew that $\hat{p}$ were normally distributed.
# This is where the powerful *Central Limit Theorem* we discussed in the section ref{ch:stats:sec:limit} comes in. The form of the estimator, which is just a
# sample mean, implies that we can apply this theorem and conclude that $\hat{p}$
# is asymptotically normally distributed. However, it doesn't quantify how many
# samples $n$ we need. In our simulation this is no problem because we can
# generate as much data as we like, but in the real world, with a costly
# experiment, each sample may be precious [^edgeworth].
#
# [^edgeworth]: It turns out that the central limit theorem augmented with an
# Edgeworth expansion tells us that convergence is regulated by the skewness
# of the distribution [[feller1950introduction]](#feller1950introduction). In other words, the
# more symmetric the distribution, the faster it converges to the normal
# distribution according to the central limit theorem.
#
# In the following, we won't apply the Central Limit Theorem and instead proceed
# analytically.
#
# ### Probability Density for the Estimator
#
# To write out the full density for $\hat{p}$, we first have to ask what is
# the probability that the estimator will equal a specific value and the tally up
# all the ways that could happen with their corresponding probabilities. For
# example, what is the probability that
# $$
# \hat{p} = \frac{1}{n}\sum_{i=1}^n x_i = 0
# $$
# This can only happen one way: when $x_i=0 \hspace{0.5em} \forall i$. The
# probability of this happening can be computed from the density
# $$
# f(\mathbf{x},p)= \prod_{i=1}^n \left(p^{x_i} (1-p)^{1-x_i} \right)
# $$
# $$
# f\left(\sum_{i=1}^n x_i = 0,p\right)= \left(1-p\right)^n
# $$
# Likewise, if $\lbrace x_i \rbrace$ has only one nonzero element, then
# $$
# f\left(\sum_{i=1}^n x_i = 1,p\right)= n p \prod_{i=1}^{n-1} \left(1-p\right)
# $$
# where the $n$ comes from the $n$ ways to pick one element
# from the $n$ elements $x_i$. Continuing this way, we can construct the
# entire density as
# $$
# f\left(\sum_{i=1}^n x_i = k,p\right)= \binom{n}{k} p^k (1-p)^{n-k}
# $$
# where the first term on the right is the binomial coefficient of $n$ things
# taken $k$ at a time. This is the binomial distribution and it's not the
# density for $\hat{p}$, but rather for $n\hat{p}$. We'll leave this as-is
# because it's easier to work with below. We just have to remember to keep
# track of the $n$ factor.
#
# **Confidence Intervals**
#
# Now that we have the full density for $\hat{p}$, we are ready to ask some
# meaningful questions. For example, what is the probability the estimator is within
# $\epsilon$ fraction of the true value of $p$?
# $$
# \mathbb{P}\left( \vert \hat{p}-p \vert \le \epsilon p \right)
# $$
# More concretely, we want to know how often the
# estimated $\hat{p}$ is trapped within $\epsilon$ of the actual value. That is,
# suppose we ran the experiment 1000 times to generate 1000 different estimates
# of $\hat{p}$. What percentage of the 1000 so-computed values are trapped within
# $\epsilon$ of the underlying value. Rewriting the above equation as the
# following,
# $$
# \mathbb{P}\left(p-\epsilon p < \hat{p} < p + \epsilon p \right) = \mathbb{P}\left( n p - n \epsilon p < \sum_{i=1}^n x_i < n p + n \epsilon p \right)
# $$
# Let's plug in some live numbers here for our worst case
# scenario (i.e., highest variance scenario) where $p=1/2$. Then, if
# $\epsilon = 1/100$, we have
# $$
# \mathbb{P}\left( \frac{99 n}{100} < \sum_{i=1}^n x_i < \frac{101 n}{100} \right)
# $$
# Since the sum in integer-valued, we need $n> 100$ to even compute this.
# Thus, if $n=101$ we have,
# $$
# \begin{eqnarray*}
# \mathbb{P}\left(\frac{9999}{200} < \sum_{i=1}^{101} x_i < \frac{10201}{200} \right) = f\left(\sum_{i=1}^{101} x_i = 50,p\right) & \ldots \\\
# = \binom{101}{50} (1/2)^{50} (1-1/2)^{101-50} & = & 0.079
# \end{eqnarray*}
# $$
# This means that in the worst-case scenario for $p=1/2$, given $n=101$
# trials, we will only get within 1\% of the actual $p=1/2$ about 8\% of the
# time. If you feel disappointed, that only means you've been paying attention.
# What if the coin was really heavy and it was hard work to repeat this 101 times?
#
# Let's come at this another way: given I could only flip the coin 100
# times, how close could I come to the true underlying value with high
# probability (say, 95\%)? In this case, instead of picking a value for
# $\epsilon$, we are solving for $\epsilon$. Plugging in gives,
# $$
# \mathbb{P}\left(50 - 50\epsilon < \sum_{i=1}^{100} x_i < 50 + 50 \epsilon \right) = 0.95
# $$
# which we have to solve for $\epsilon$. Fortunately, all the tools we
# need to solve for this are already in Scipy.
# In[8]:
from scipy.stats import binom
# n=100, p = 0.5, distribution of the estimator phat
b=binom(100,.5)
# symmetric sum the probability around the mean
g = lambda i:b.pmf(np.arange(-i,i)+50).sum()
print g(10) # approx 0.95
# In[9]:
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.pylab import subplots, arange
fig,ax= subplots()
fig.set_size_inches((10,5))
# here is the density of the sum of x_i
_=ax.stem(arange(0,101),b.pmf(arange(0,101)),
linefmt='k-', markerfmt='ko')
_=ax.vlines( [50+10,50-10],0 ,ax.get_ylim()[1] ,color='k',lw=3.)
_=ax.axis(xmin=30,xmax=70)
_=ax.tick_params(labelsize=18)
#fig.savefig('fig-statistics/Maximum_likelihood_20_2.png')
fig.tight_layout()
#
#
#
#
# Probability mass function for $\hat{p}$. The two vertical lines form the confidence interval.
#
#
#
#
#
# The two vertical lines in the plot show how far out from the mean we
# have to go to accumulate 95\% of the probability. Now, we can solve this as
# $$
# 50+50\epsilon=60
# $$
# which makes $\epsilon=1/5$ or 20\%. So, flipping 100 times means I can
# only get within 20\% of the real $p$ 95\% of the time in the worst case
# scenario (i.e., $p=1/2$). The following code verifies the situation.
# In[10]:
from scipy.stats import bernoulli
b=bernoulli(0.5) # coin distribution
xs = b.rvs(100) # flip it 100 times
phat = np.mean(xs) # estimated p
print abs(phat-0.5) < 0.5*0.20 # make it w/in interval?
# Let's keep doing this and see if we can get within this interval 95\% of
# the time.
# In[11]:
out=[]
b=bernoulli(0.5) # coin distribution
for i in range(500): # number of tries
xs = b.rvs(100) # flip it 100 times
phat = np.mean(xs) # estimated p
out.append(abs(phat-0.5) < 0.5*0.20 ) # within 20% ?
# percentage of tries w/in 20% interval
print 100*np.mean(out)
# Well, that seems to work! Now we have a way to get at the quality of
# the estimator, $\hat{p}$.
#
# **Maximum Likelihood Estimator Without Calculus**
#
# The prior example showed how we can use calculus to compute the maximum
# likelihood estimator. It's important to emphasize that the maximum likelihood
# principle does *not* depend on calculus and extends to more general situations
# where calculus is impossible. For example, let $X$ be uniformly distributed in
# the interval $[0,\theta]$. Given $n$ measurements of $X$, the likelihood
# function is the following:
# $$
# L(\theta) = \prod_{i=1}^n \frac{1}{\theta} = \frac{1}{\theta^n}
# $$
# where each $x_i \in [0,\theta]$. Note that the slope of this function
# is not zero anywhere so the usual calculus approach is not going to work here.
# Because the likelihood is the product of the individual uniform densities, if
# any of the $x_i$ values were outside of the proposed $[0,\theta]$ interval,
# then the likelihood would go to zero, because the uniform density is zero
# outside of the $[0,\theta]$. Naturally, this is no good for maximization. Thus,
# observing that the likelihood function is strictly decreasing with increasing
# $\theta$, we conclude that the value for $\theta$ that maximizes the likelihood
# is the maximum of the $x_i$ values. To summarize, the maximum likelihood
# estimator is the following:
# $$
# \theta_{ML} = \max_i x_i
# $$
# As always, we want the distribution of this estimator to judge its
# performance. In this case, this is pretty straightforward. The cumulative
# density function for the $\max$ function is the following:
# $$
# \mathbb{P} \left( \hat{\theta}_{ML} < v \right) = \mathbb{P}( x_0 \leq v \wedge x_1 \leq v \ldots \wedge x_n \leq v)
# $$
# and since all the $x_i$ are uniformly distributed in $[0,\theta]$, we have
# $$
# \mathbb{P} \left( \hat{\theta}_{ML} < v \right) = \left(\frac{v}{\theta}\right)^n
# $$
# So, the probability density function is then,
# $$
# f_{\hat{\theta}_{ML}}(\theta_{ML}) = n \theta_{ML}^{ n-1 } \theta^{ -n }
# $$
# Then, we can compute the $\mathbb{E}(\theta_{ML}) = (\theta n)/(n+1)$ with
# corresponding variance as $\mathbb{V}(\theta_{ML}) = (\theta^2 n)/(n+1)^2/(n+2)$.
#
# For a quick sanity check, we can write the following simulation for $\theta =1$
# as in the following:
# In[12]:
from scipy import stats
rv = stats.uniform(0,1) # define uniform random variable
mle=rv.rvs((100,500)).max(0) # max along row-dimension
print mean(mle) # approx n/(n+1) = 100/101 ~= 0.99
0.989942138048
print var(mle) #approx n/(n+1)**2/(n+2) ~= 9.61E-5
9.95762009884e-05
# **Programming Tip.**
#
# The `max(0)` suffix on for the `mle` computation takes
# the maximum of the so-computed array along the column (`axis=0`)
# dimension.
#
#
#
# You can also plot `hist(mle)` to see the histogram of the simulated
# maximum likelihood estimates and match it up against the probability density
# function we derived above.
#
#
# In this section, we explored the concept of maximum
# likelihood estimation using a coin flipping experiment both analytically and
# numerically with the scientific Python stack. We also explored the case when
# calculus is not workable for maximum likelihood estimation. There are two key
# points to remember. First, maximum likelihood estimation produces a function of
# the data that is itself a random variable, with its own probability
# distribution. We can get at the quality of the so-derived estimators by
# examining the confidence intervals around the estimated values using the
# probability distributions associated with the estimators themselves.
# Second, maximum likelihood estimation applies even in situations
# where using basic calculus is not applicable [[wasserman2004all]](#wasserman2004all).
#
#
# ## Delta Method
#
#
# The Central Limit Theorem provides a way to get at the distribution of a random
# variable. However, sometimes we are more interested in a function of the random
# variable. In order to extend and generalize the central limit theorem in this
# way, we need the Taylor series expansion. Recall that the Taylor series
# expansion is an approximation of a function of the following form,
# $$
# T_r(x) =\sum_{i=0}^r \frac{g^{(i)}(a)}{i!}(x-a)^i
# $$
# this basically says that a function $g$ can be adequately
# approximated about a point $a$ using a polynomial based on its derivatives
# evaluated at $a$. Before we state the general theorem, let's examine
# an example to understand how the mechanics work.
#
# **Example.** Suppose that $X$ is a random variable with
# $\mathbb{E}(X)=\mu\neq 0$. Furthermore, supposedly have a suitable
# function $g$ and we want the distribution of $g(X)$. Applying the
# Taylor series expansion, we obtain the following,
# $$
# g(X) \approx g(\mu)+ g^{\prime}(\mu)(X-\mu)
# $$
# If we use $g(X)$ as an estimator for $g(\mu)$, then we can say that
# we approximately have the following
# $$
# \begin{align*}
# \mathbb{E}(g(X)) &=g(\mu) \\\
# \mathbb{V}(g(X)) &=(g^{\prime}(\mu))^2 \mathbb{V}(X) \\\
# \end{align*}
# $$
# Concretely, suppose we want to estimate the odds, $\frac{p}{1-p}$.
# For example, if $p=2/3$, then we say that the odds is `2:1` meaning that the
# odds of the one outcome are twice as likely as the odds of the other outcome.
# Thus, we have $g(p)=\frac{p}{1-p}$ and we want to find
# $\mathbb{V}(g(\hat{p}))$. In our coin-flipping problem, we have the
# estimator $\hat{p}=\frac{1}{n}\sum X_k$ from the Bernoulli-distributed data
# $X_k$ individual coin-flips. Thus,
# $$
# \begin{align*}
# \mathbb{E}(\hat{p}) &= p \\\
# \mathbb{V}(\hat{p}) &= \frac{p(1-p)}{n} \\\
# \end{align*}
# $$
# Now, $g^\prime(p)=1/(1-p)^2$, so we have,
# $$
# \begin{align*}
# \mathbb{V}(g(\hat{p}))&=(g^\prime(p))^2 \mathbb{V}(\hat{p}) \\\
# &=\left(\frac{1}{(1-p)^2}\right)^2 \frac{p(1-p)}{n} \\\
# &= \frac{p}{n(1-p)^3} \\\
# \end{align*}
# $$
# which is an approximation of the variance of the estimator
# $g(\hat{p})$. Let's simulate this and see how it agrees.
# In[13]:
from scipy import stats
# compute MLE estimates
d=stats.bernoulli(0.1).rvs((10,5000)).mean(0)
# avoid divide-by-zero
d=d[np.logical_not(np.isclose(d,1))]
# compute odds ratio
odds = d/(1-d)
print 'odds ratio=',np.mean(odds),'var=',np.var(odds)
# The first number above is the mean of the simulated odds
# ratio and the second is the variance of the estimate. According to
# the variance estimate above, we have $\mathbb{V}(g(1/10))\approx
# 0.0137$, which is not too bad for this approximation. Recall we want
# to estimate the odds from the $\hat{p}$. The code above takes $5000$
# estimates of the $\hat{p}$ to estimate $\mathbb{V}(g)$. The odds ratio
# for $p=1/10$ is $1/9\approx 0.111$.
#
# **Programming Tip.**
#
# The code above uses the `np.isclose` function to identify the ones from
# the simulation and the `np.logical_not` removes these elements from the
# data because the odds ratio has a zero in the denominator
# for these values.
#
#
#
# Let's try this again with a probability of heads of `0.5` instead of
# `0.3`.
# In[14]:
from scipy import stats
d=stats.bernoulli(.5).rvs((10,5000)).mean(0)
d=d[np.logical_not(np.isclose(d,1))]
print 'odds ratio=',np.mean(d),'var=',np.var(d)
# The odds ratio is this case is equal to one, which
# is not close to what was reported. According to our
# approximation, we have $\mathbb{V}(g)=0.4$, which does not
# look like what our simulation just reported. This is
# because the approximation is best when the odds ratio is
# nearly linear and worse otherwise.
#
#
#
#
#
# The odds ratio is close to linear for small values but becomes unbounded as $p$ approaches one. The delta method is more effective for small underlying values of $p$, where the linear approximation is better.
#
#
#