#!/usr/bin/env python # coding: utf-8 # In[1]: from IPython.display import Image Image('../../../python_for_probability_statistics_and_machine_learning.jpg') # [Python for Probability, Statistics, and Machine Learning](https://www.springer.com/fr/book/9783319307152) # In[2]: from __future__ import division import numpy as np np.random.seed(1234) # In[3]: get_ipython().run_line_magic('pylab', 'inline') # The estimation problem starts with the desire to infer something meaningful # from data. For parametric estimation, the strategy is to postulate a model for # the data and then use the data to fit model parameters. This leads to two # fundamental questions: where to get the model and how to estimate the # parameters? The first question is best answered by the maxim: *all models are # wrong, some are useful*. In other words, choosing a model depends as much on # the application as on the model itself. Think about models as building # different telescopes to view the sky. No one would ever claim that the # telescope generates the sky! It is same with data models. Models give us # multiple perspectives on the data that themselves are proxies for some deeper # underlying phenomenon. # # Some categories of data may be more commonly studied using certain types of # models, but this is usually very domain-specific and ultimately depends on the # aims of the analysis. In some cases, there may be strong physical reasons # behind choosing a model. For example, one could postulate that the model is # linear with some noise as in the following: # $$ # Y = a X + \epsilon # $$ # which basically says that you, as the experimenter, dial in some # value for $X$ and then read off something directly proportional to $X$ as the # measurement, $Y$, plus some additive noise that you attribute to jitter in the # apparatus. Then, the next step is to estimate the paramater $a$ in the model, # given some postulated claim about the nature of $\epsilon$. How to compute the # model parameters depends on the particular methodology. The two broad rubrics # are parametric and non-parametric estimation. In the former, we assume we know # the density function of the data and then try to derive the embedded parameters # for it. In the latter, we claim only to know that the density function is a # member of a broad class of density functions and then use the data # to characterize a member of that class. Broadly speaking, the former consumes # less data than the latter, because there are fewer unknowns to compute from # the data. # # Let's concentrate on parametric estimation for now. The tradition is to denote # the unknown parameter to be estimated as $\theta$ which is a member of a large # space of alternates, $\Theta$. To judge between potential $\theta$ values, we # need an objective function, known as a *risk* function, # $L(\theta,\hat{\theta})$, where $\hat{\theta}(\mathbf{x})$ is an # estimate for the unknown $\theta$ that is derived from the available # data $\mathbf{x}$. The most common and useful risk function is the # squared error loss, # $$ # L(\theta,\hat{\theta}) = (\theta-\hat{\theta})^2 # $$ # Although neat, this is not practical because we need to know the # unknown $\theta$ to compute it. The other problem is because $\hat{\theta}$ is # a function of the observed data, it is also a random variable with its own # probability density function. This leads to the notion of the *expected risk* # function, # $$ # R(\theta,\hat{\theta}) = \mathbb{E}_\theta(L(\theta,\hat{\theta})) = \int L(\theta,\hat{\theta}(\mathbf{x})) f(\mathbf{x};\theta) d \mathbf{x} # $$ # In other words, given a fixed $\theta$, integrate over the # probability density function of the data, $f(\mathbf{x})$, to compute the # risk. Plugging in for the squared error loss, we compute the # mean squared error, # $$ # \mathbb{E}_\theta(\theta-\hat{\theta})^2 =\int (\theta-\hat{\theta})^2 f(\mathbf{x};\theta) d \mathbf{x} # $$ # This has the important factorization into the *bias*, # $$ # \texttt{bias} = \mathbb{E}_\theta(\hat{\theta})-\theta # $$ # with the corresponding variance, $\mathbb{V}_\theta(\hat{\theta})$ as # in the following *mean squared error* (MSE): # $$ # \mathbb{E}_\theta(\theta-\hat{\theta})^2= \texttt{bias}^2+\mathbb{V}_\theta(\hat{\theta}) # $$ # This is an important trade-off that we will return to repeatedly. The # idea is the bias is nonzero when the estimator $\hat{\theta}$, integrated # over all possible data, $f(\mathbf{x})$, does not equal the underlying target # parameter $\theta$. In some sense, the estimator misses the target, no matter # how much data is used. When the bias equals zero, the estimated is *unbiased*. # For fixed MSE, low bias implies high variance and vice-versa. This trade-off # was once not emphasized and instead much attention was paid to the smallest # variance of unbiased estimators (see Cramer-Rao bounds). In practice, # understanding and exploiting the trade-off between bias and variance and # reducing the MSE is more important. # # With all this set up, we can now ask how bad can bad get by # examining *minimax* risk, # $$ # R_{\texttt{mmx}} = \inf_{\hat{\theta}} \sup_\theta R(\theta,\hat{\theta}) # $$ # where the $\inf$ is take over all estimators. Intuitively, this # means if we found the worst possible $\theta$ and swept over all possible # parameter estimators $\hat{\theta}$, and then took the smallest possible risk # we could find, we would have the minimax risk. Thus, an estimator, # $\hat{\theta}_{\texttt{mmx}}$, is a *minimax estimator* if it achieves this # feat, # $$ # \sup_\theta R(\theta,\hat{\theta}_{\texttt{mmx}}) =\inf_{\hat{\theta}} \sup_\theta R(\theta,\hat{\theta}) # $$ # In other words, even in the face of the worst $\theta$ (i.e., the # $\sup_\theta$), $\hat{\theta}_{\texttt{mmx}}$ still achieves the minimax # risk. There is a greater theory that revolves around minimax estimators of # various kinds, but this is far beyond our scope here. The main thing to focus # on is that under certain technical but easily satisfiable conditions, the # maximum likelihood estimator is approximately minimax. Maximum likelihood is # the subject of the next section. Let's get started with the simplest # application: coin-flipping. # # ## Setting up the Coin Flipping Experiment # # Suppose we have coin and want to estimate the probability of heads ($p$) for # it. We model the distribution of heads and tails as a Bernoulli distribution # with the following probability mass function: # $$ # \phi(x)= p^x (1-p)^{(1-x)} # $$ # where $x$ is the outcome, *1* for heads and *0* for tails. Note that # maximum likelihood is a parametric method that requires the specification of a # particular model for which we will compute embedded parameters. For $n$ # independent flips, we have the joint density as the product of $n$ of # these functions as in, # $$ # \phi(\mathbf{x})=\prod_{i=1}^n p^x_i (1-p)^{(1-x_i)} # $$ # The following is the *likelihood function*, # $$ # \mathcal{L}(p ; \mathbf{x})= \prod_{i=1}^n p^{ x_i }(1-p)^{1-x_i} # $$ # This is basically notation. We have just renamed the # previous equation to emphasize the $p$ parameter, which is what # we want to estimate. # # The principle of *maximum likelihood* is to maximize the likelihood as the # function of $p$ after plugging in all of the $x_i$ data. We then call this # maximizer $\hat{p}$ which is a function of the observed $x_i$ data, and as # such, is a random variable with its own distribution. This method therefore # ingests data and an assumed model for the probability density, and produces a # function that estimates the embedded parameter in the assumed probability # density. Thus, maximum likelihood generates the *functions* of data that we # need in order to get at the underlying parameters of the model. Note that there # is no limit to the ways we can functionally manipulate the data we have # collected. The maximum likelihood principle gives us a systematic method for # constructing these functions subject to the assumed model. This is a point # worth emphasizing: the maximum likelihood principle yields functions as # solutions the same way solving differential equations yields functions as # solutions. It is very, very much harder to produce a function than to produce a # value as a solution, even with the assumption of a convenient probability # density. Thus, the power of the principle is that you can construct such # functions subject to the model assumptions. # # ### Simulating the Experiment # # We need the following code to simulate coin flipping. # In[4]: from scipy.stats import bernoulli p_true=1/2.0 # estimate this! fp=bernoulli(p_true) # create bernoulli random variate xs = fp.rvs(100) # generate some samples print xs[:30] # see first 30 samples # Now, we can write out the likelihood function using Sympy. Note # that we give the Sympy variables the `positive=True` attribute upon # construction because this eases Sympy's internal simplification algorithms. # In[5]: import sympy x,p,z=sympy.symbols('x p z', positive=True) phi=p**x*(1-p)**(1-x) # distribution function L=np.prod([phi.subs(x,i) for i in xs]) # likelihood function print L # approx 0.5? # Note that, once we plug in the data, the likelihood function is # solely a function of the unknown parameter ($p$ in this case). The following # code uses calculus to find the extrema of the likelihood function. Note that # taking the `log` of $L$ makes the maximization problem tractable but doesn't # change the extrema. # In[6]: logL=sympy.expand_log(sympy.log(L)) sol,=sympy.solve(sympy.diff(logL,p),p) print sol # **Programming Tip.** # # Note that `sol,=sympy.solve` statement includes # a comma after the `sol` variable. This is because the `solve` # function returns a list containing a single element. Using # this assignment unpacks that single element into the `sol` variable # directly. This is another one of the many small elegancies of Python. # # # # The following code generates [Figure](#fig:Maximum_likelihood_10_2). # In[7]: fig,ax=subplots() x=np.linspace(0,1,100) ax.plot(x,map(sympy.lambdify(p,logL,'numpy'),x),'k-',lw=3) ax.plot(sol,logL.subs(p,sol),'o', color='gray',ms=15,label='Estimated') ax.plot(p_true,logL.subs(p,p_true),'s', color='k',ms=15,label='Actual') ax.set_xlabel('$p$',fontsize=18) ax.set_ylabel('Likelihood',fontsize=18) ax.set_title('Estimate not equal to true value',fontsize=18) ax.legend(loc=0) # **Programming Tip.** # # In the prior code, we use the `lambdify` function in `lambdify(p,logL,'numpy')` to # take a Sympy expression and convert it into a Numpy version that is easier to # compute. The `lambdify` function has an extra argument where you can specify # the function space that it should use to convert the expression. In the above # this is set to Numpy. # # # # # #
# #

Maximum likelihood estimate vs. true parameter. Note that the estimate is slightly off from the true value. This is a consequence of the fact that the estimator is a function of the data and lacks knowledge of the true underlying value.

# # # # # # [Figure](#fig:Maximum_likelihood_10_2) shows that our estimator $\hat{p}$ # (circle) is not equal to the true value of $p$ (square), despite being # the maximum of the likelihood function. This may sound disturbing, but keep in # mind this estimate is a function of the random data; and since that data can # change, the ultimate estimate can likewise change. I invite you to run this # code in the corresponding IPython notebook a few times to observe this. # Remember that the estimator is a *function* of the data and is thus also a # *random variable*, just like the data is. This means it has its own probability # distribution with corresponding mean and variance. So, what we are observing is # a consequence of that variance. # # # # # # # # # # # # # # # # # # # # # # # # # # #
# #

Histogram of maximum likelihood estimates. The title shows the estimated mean and standard deviation of the samples.

# # # # # # [Figure](#fig:Maximum_likelihood_30_2) shows what happens when you run many # thousands of coin experiments and compute the maximum likelihood # estimate for each experiment, given a particular number of samples # per experiment. This simulation gives us a histogram of the maximum likelihood # estimates, which is an approximation of the probability distribution of the # $\hat{p}$ estimator itself. This figure shows that the sample mean # of the estimator ($\mu = \frac{1}{n}\sum \hat{p}_i$) is pretty close to the # true value, but looks can be deceiving. The only way to know for sure is to # check if the estimator is unbiased, namely, if # $$ # \mathbb{E}(\hat{p}) = p # $$ # Because this problem is simple, we can solve for this in general # noting that the terms above are either $p$, if $x_i=1$ or $1-p$ if $x_i=0$. # This means that we can write # $$ # \mathcal{L}(p\vert \mathbf{x})= p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i} # $$ # with corresponding logarithm as # $$ # J=\log(\mathcal{L}(p\vert \mathbf{x})) = \log(p) \sum_{i=1}^n x_i + \log(1-p) \left(n-\sum_{i=1}^n x_i\right) # $$ # Taking the derivative of this gives: # $$ # \frac{dJ}{dp} = \frac{1}{p}\sum_{i=1}^n x_i + \frac{(n-\sum_{i=1}^n x_i)}{p-1} # $$ # and solving this for $p$ leads to # $$ # \hat{p} = \frac{1}{ n} \sum_{i=1}^n x_i # $$ # This is our *estimator* for $p$. Up until now, we have been using Sympy to # solve for this based on the data $x_i$ but now that we have it analytically we # don't have to solve for it each time. To check if this estimator is biased, we # compute its expectation: # $$ # \mathbb{E}\left(\hat{p}\right) =\frac{1}{n}\sum_i^n \mathbb{E}(x_i) = \frac{1}{n} n \mathbb{E}(x_i) # $$ # by linearity of the expectation and where # $$ # \mathbb{E}(x_i) = p # $$ # Therefore, # $$ # \mathbb{E}\left(\hat{p}\right) =p # $$ # This means that the estimator is *unbiased*. Similarly, # $$ # \mathbb{E}\left(\hat{p}^2\right) = \frac{1}{n^2} \mathbb{E}\left[\left( \sum_{i=1}^n x_i \right)^2 \right] # $$ # and where # $$ # \mathbb{E}\left(x_i^2\right) =p # $$ # and by the independence assumption, # $$ # \mathbb{E}\left(x_i x_j\right) =\mathbb{E}(x_i)\mathbb{E}(x_j) =p^2 # $$ # Thus, # $$ # \mathbb{E}\left(\hat{p}^2\right) =\left(\frac{1}{n^2}\right) n \left[ p+(n-1)p^2 \right] # $$ # So, the variance of the estimator, $\hat{p}$, is the following: # $$ # \mathbb{V}(\hat{p}) = \mathbb{E}\left(\hat{p}^2\right)- \mathbb{E}\left(\hat{p}\right)^2 = \frac{p(1-p)}{n} # $$ # Note that the $n$ in the denominator means that the variance # asymptotically goes to zero as $n$ increases (i.e., we consider more and # more samples). This is good news because it means that more and # more coin flips lead to a better estimate of the underlying $p$. # # Unfortunately, this formula for the variance is practically useless because we # need $p$ to compute it and $p$ is the parameter we are trying to estimate in # the first place! However, this is where the *plug-in* principle [^invariance-property] # saves the day. It turns out in this situation, you can # simply substitute the maximum likelihood estimator, $\hat{p}$, for the $p$ in # the above equation to obtain the asymptotic variance for $\mathbb{V}(\hat{p})$. # The fact that this works is guaranteed by the asymptotic theory of maximum # likelihood estimators. # # [^invariance-property]: This is also known as the *invariance property* # of maximum likelihood estimators. It basically states that the # maximum likelihood estimator of any function, say, $h(\theta)$, is # the same $h$ with the maximum likelihood estimator for $\theta$ substituted # in for $\theta$; namely, $h(\theta_{ML})$. # # Nevertheless, looking at $\mathbb{V}(\hat{p})^2$, we can immediately notice # that if $p=0$, then there is no estimator variance because the outcomes are # guaranteed to be tails. Also, for any $n$, the maximum of this variance # happens at $p=1/2$. This is our worst case scenario and the only way to # compensate is with larger $n$. # # All we have computed is the mean and variance of the estimator. In general, # this is insufficient to characterize the underlying probability density of # $\hat{p}$, except if we somehow knew that $\hat{p}$ were normally distributed. # This is where the powerful *Central Limit Theorem* we discussed in the section ref{ch:stats:sec:limit} comes in. The form of the estimator, which is just a # sample mean, implies that we can apply this theorem and conclude that $\hat{p}$ # is asymptotically normally distributed. However, it doesn't quantify how many # samples $n$ we need. In our simulation this is no problem because we can # generate as much data as we like, but in the real world, with a costly # experiment, each sample may be precious [^edgeworth]. # # [^edgeworth]: It turns out that the central limit theorem augmented with an # Edgeworth expansion tells us that convergence is regulated by the skewness # of the distribution [[feller1950introduction]](#feller1950introduction). In other words, the # more symmetric the distribution, the faster it converges to the normal # distribution according to the central limit theorem. # # In the following, we won't apply the Central Limit Theorem and instead proceed # analytically. # # ### Probability Density for the Estimator # # To write out the full density for $\hat{p}$, we first have to ask what is # the probability that the estimator will equal a specific value and the tally up # all the ways that could happen with their corresponding probabilities. For # example, what is the probability that # $$ # \hat{p} = \frac{1}{n}\sum_{i=1}^n x_i = 0 # $$ # This can only happen one way: when $x_i=0 \hspace{0.5em} \forall i$. The # probability of this happening can be computed from the density # $$ # f(\mathbf{x},p)= \prod_{i=1}^n \left(p^{x_i} (1-p)^{1-x_i} \right) # $$ # $$ # f\left(\sum_{i=1}^n x_i = 0,p\right)= \left(1-p\right)^n # $$ # Likewise, if $\lbrace x_i \rbrace$ has only one nonzero element, then # $$ # f\left(\sum_{i=1}^n x_i = 1,p\right)= n p \prod_{i=1}^{n-1} \left(1-p\right) # $$ # where the $n$ comes from the $n$ ways to pick one element # from the $n$ elements $x_i$. Continuing this way, we can construct the # entire density as # $$ # f\left(\sum_{i=1}^n x_i = k,p\right)= \binom{n}{k} p^k (1-p)^{n-k} # $$ # where the first term on the right is the binomial coefficient of $n$ things # taken $k$ at a time. This is the binomial distribution and it's not the # density for $\hat{p}$, but rather for $n\hat{p}$. We'll leave this as-is # because it's easier to work with below. We just have to remember to keep # track of the $n$ factor. # # **Confidence Intervals** # # Now that we have the full density for $\hat{p}$, we are ready to ask some # meaningful questions. For example, what is the probability the estimator is within # $\epsilon$ fraction of the true value of $p$? # $$ # \mathbb{P}\left( \vert \hat{p}-p \vert \le \epsilon p \right) # $$ # More concretely, we want to know how often the # estimated $\hat{p}$ is trapped within $\epsilon$ of the actual value. That is, # suppose we ran the experiment 1000 times to generate 1000 different estimates # of $\hat{p}$. What percentage of the 1000 so-computed values are trapped within # $\epsilon$ of the underlying value. Rewriting the above equation as the # following, # $$ # \mathbb{P}\left(p-\epsilon p < \hat{p} < p + \epsilon p \right) = \mathbb{P}\left( n p - n \epsilon p < \sum_{i=1}^n x_i < n p + n \epsilon p \right) # $$ # Let's plug in some live numbers here for our worst case # scenario (i.e., highest variance scenario) where $p=1/2$. Then, if # $\epsilon = 1/100$, we have # $$ # \mathbb{P}\left( \frac{99 n}{100} < \sum_{i=1}^n x_i < \frac{101 n}{100} \right) # $$ # Since the sum in integer-valued, we need $n> 100$ to even compute this. # Thus, if $n=101$ we have, # $$ # \begin{eqnarray*} # \mathbb{P}\left(\frac{9999}{200} < \sum_{i=1}^{101} x_i < \frac{10201}{200} \right) = f\left(\sum_{i=1}^{101} x_i = 50,p\right) & \ldots \\\ # = \binom{101}{50} (1/2)^{50} (1-1/2)^{101-50} & = & 0.079 # \end{eqnarray*} # $$ # This means that in the worst-case scenario for $p=1/2$, given $n=101$ # trials, we will only get within 1\% of the actual $p=1/2$ about 8\% of the # time. If you feel disappointed, that only means you've been paying attention. # What if the coin was really heavy and it was hard work to repeat this 101 times? # # Let's come at this another way: given I could only flip the coin 100 # times, how close could I come to the true underlying value with high # probability (say, 95\%)? In this case, instead of picking a value for # $\epsilon$, we are solving for $\epsilon$. Plugging in gives, # $$ # \mathbb{P}\left(50 - 50\epsilon < \sum_{i=1}^{100} x_i < 50 + 50 \epsilon \right) = 0.95 # $$ # which we have to solve for $\epsilon$. Fortunately, all the tools we # need to solve for this are already in Scipy. # In[8]: from scipy.stats import binom # n=100, p = 0.5, distribution of the estimator phat b=binom(100,.5) # symmetric sum the probability around the mean g = lambda i:b.pmf(np.arange(-i,i)+50).sum() print g(10) # approx 0.95 # In[9]: get_ipython().run_line_magic('matplotlib', 'inline') from matplotlib.pylab import subplots, arange fig,ax= subplots() fig.set_size_inches((10,5)) # here is the density of the sum of x_i _=ax.stem(arange(0,101),b.pmf(arange(0,101)), linefmt='k-', markerfmt='ko') _=ax.vlines( [50+10,50-10],0 ,ax.get_ylim()[1] ,color='k',lw=3.) _=ax.axis(xmin=30,xmax=70) _=ax.tick_params(labelsize=18) #fig.savefig('fig-statistics/Maximum_likelihood_20_2.png') fig.tight_layout() # # #
# #

Probability mass function for $\hat{p}$. The two vertical lines form the confidence interval.

# # # # # # The two vertical lines in the plot show how far out from the mean we # have to go to accumulate 95\% of the probability. Now, we can solve this as # $$ # 50+50\epsilon=60 # $$ # which makes $\epsilon=1/5$ or 20\%. So, flipping 100 times means I can # only get within 20\% of the real $p$ 95\% of the time in the worst case # scenario (i.e., $p=1/2$). The following code verifies the situation. # In[10]: from scipy.stats import bernoulli b=bernoulli(0.5) # coin distribution xs = b.rvs(100) # flip it 100 times phat = np.mean(xs) # estimated p print abs(phat-0.5) < 0.5*0.20 # make it w/in interval? # Let's keep doing this and see if we can get within this interval 95\% of # the time. # In[11]: out=[] b=bernoulli(0.5) # coin distribution for i in range(500): # number of tries xs = b.rvs(100) # flip it 100 times phat = np.mean(xs) # estimated p out.append(abs(phat-0.5) < 0.5*0.20 ) # within 20% ? # percentage of tries w/in 20% interval print 100*np.mean(out) # Well, that seems to work! Now we have a way to get at the quality of # the estimator, $\hat{p}$. # # **Maximum Likelihood Estimator Without Calculus** # # The prior example showed how we can use calculus to compute the maximum # likelihood estimator. It's important to emphasize that the maximum likelihood # principle does *not* depend on calculus and extends to more general situations # where calculus is impossible. For example, let $X$ be uniformly distributed in # the interval $[0,\theta]$. Given $n$ measurements of $X$, the likelihood # function is the following: # $$ # L(\theta) = \prod_{i=1}^n \frac{1}{\theta} = \frac{1}{\theta^n} # $$ # where each $x_i \in [0,\theta]$. Note that the slope of this function # is not zero anywhere so the usual calculus approach is not going to work here. # Because the likelihood is the product of the individual uniform densities, if # any of the $x_i$ values were outside of the proposed $[0,\theta]$ interval, # then the likelihood would go to zero, because the uniform density is zero # outside of the $[0,\theta]$. Naturally, this is no good for maximization. Thus, # observing that the likelihood function is strictly decreasing with increasing # $\theta$, we conclude that the value for $\theta$ that maximizes the likelihood # is the maximum of the $x_i$ values. To summarize, the maximum likelihood # estimator is the following: # $$ # \theta_{ML} = \max_i x_i # $$ # As always, we want the distribution of this estimator to judge its # performance. In this case, this is pretty straightforward. The cumulative # density function for the $\max$ function is the following: # $$ # \mathbb{P} \left( \hat{\theta}_{ML} < v \right) = \mathbb{P}( x_0 \leq v \wedge x_1 \leq v \ldots \wedge x_n \leq v) # $$ # and since all the $x_i$ are uniformly distributed in $[0,\theta]$, we have # $$ # \mathbb{P} \left( \hat{\theta}_{ML} < v \right) = \left(\frac{v}{\theta}\right)^n # $$ # So, the probability density function is then, # $$ # f_{\hat{\theta}_{ML}}(\theta_{ML}) = n \theta_{ML}^{ n-1 } \theta^{ -n } # $$ # Then, we can compute the $\mathbb{E}(\theta_{ML}) = (\theta n)/(n+1)$ with # corresponding variance as $\mathbb{V}(\theta_{ML}) = (\theta^2 n)/(n+1)^2/(n+2)$. # # For a quick sanity check, we can write the following simulation for $\theta =1$ # as in the following: # In[12]: from scipy import stats rv = stats.uniform(0,1) # define uniform random variable mle=rv.rvs((100,500)).max(0) # max along row-dimension print mean(mle) # approx n/(n+1) = 100/101 ~= 0.99 0.989942138048 print var(mle) #approx n/(n+1)**2/(n+2) ~= 9.61E-5 9.95762009884e-05 # **Programming Tip.** # # The `max(0)` suffix on for the `mle` computation takes # the maximum of the so-computed array along the column (`axis=0`) # dimension. # # # # You can also plot `hist(mle)` to see the histogram of the simulated # maximum likelihood estimates and match it up against the probability density # function we derived above. # # # In this section, we explored the concept of maximum # likelihood estimation using a coin flipping experiment both analytically and # numerically with the scientific Python stack. We also explored the case when # calculus is not workable for maximum likelihood estimation. There are two key # points to remember. First, maximum likelihood estimation produces a function of # the data that is itself a random variable, with its own probability # distribution. We can get at the quality of the so-derived estimators by # examining the confidence intervals around the estimated values using the # probability distributions associated with the estimators themselves. # Second, maximum likelihood estimation applies even in situations # where using basic calculus is not applicable [[wasserman2004all]](#wasserman2004all). # # # ## Delta Method #
# # The Central Limit Theorem provides a way to get at the distribution of a random # variable. However, sometimes we are more interested in a function of the random # variable. In order to extend and generalize the central limit theorem in this # way, we need the Taylor series expansion. Recall that the Taylor series # expansion is an approximation of a function of the following form, # $$ # T_r(x) =\sum_{i=0}^r \frac{g^{(i)}(a)}{i!}(x-a)^i # $$ # this basically says that a function $g$ can be adequately # approximated about a point $a$ using a polynomial based on its derivatives # evaluated at $a$. Before we state the general theorem, let's examine # an example to understand how the mechanics work. # # **Example.** Suppose that $X$ is a random variable with # $\mathbb{E}(X)=\mu\neq 0$. Furthermore, supposedly have a suitable # function $g$ and we want the distribution of $g(X)$. Applying the # Taylor series expansion, we obtain the following, # $$ # g(X) \approx g(\mu)+ g^{\prime}(\mu)(X-\mu) # $$ # If we use $g(X)$ as an estimator for $g(\mu)$, then we can say that # we approximately have the following # $$ # \begin{align*} # \mathbb{E}(g(X)) &=g(\mu) \\\ # \mathbb{V}(g(X)) &=(g^{\prime}(\mu))^2 \mathbb{V}(X) \\\ # \end{align*} # $$ # Concretely, suppose we want to estimate the odds, $\frac{p}{1-p}$. # For example, if $p=2/3$, then we say that the odds is `2:1` meaning that the # odds of the one outcome are twice as likely as the odds of the other outcome. # Thus, we have $g(p)=\frac{p}{1-p}$ and we want to find # $\mathbb{V}(g(\hat{p}))$. In our coin-flipping problem, we have the # estimator $\hat{p}=\frac{1}{n}\sum X_k$ from the Bernoulli-distributed data # $X_k$ individual coin-flips. Thus, # $$ # \begin{align*} # \mathbb{E}(\hat{p}) &= p \\\ # \mathbb{V}(\hat{p}) &= \frac{p(1-p)}{n} \\\ # \end{align*} # $$ # Now, $g^\prime(p)=1/(1-p)^2$, so we have, # $$ # \begin{align*} # \mathbb{V}(g(\hat{p}))&=(g^\prime(p))^2 \mathbb{V}(\hat{p}) \\\ # &=\left(\frac{1}{(1-p)^2}\right)^2 \frac{p(1-p)}{n} \\\ # &= \frac{p}{n(1-p)^3} \\\ # \end{align*} # $$ # which is an approximation of the variance of the estimator # $g(\hat{p})$. Let's simulate this and see how it agrees. # In[13]: from scipy import stats # compute MLE estimates d=stats.bernoulli(0.1).rvs((10,5000)).mean(0) # avoid divide-by-zero d=d[np.logical_not(np.isclose(d,1))] # compute odds ratio odds = d/(1-d) print 'odds ratio=',np.mean(odds),'var=',np.var(odds) # The first number above is the mean of the simulated odds # ratio and the second is the variance of the estimate. According to # the variance estimate above, we have $\mathbb{V}(g(1/10))\approx # 0.0137$, which is not too bad for this approximation. Recall we want # to estimate the odds from the $\hat{p}$. The code above takes $5000$ # estimates of the $\hat{p}$ to estimate $\mathbb{V}(g)$. The odds ratio # for $p=1/10$ is $1/9\approx 0.111$. # # **Programming Tip.** # # The code above uses the `np.isclose` function to identify the ones from # the simulation and the `np.logical_not` removes these elements from the # data because the odds ratio has a zero in the denominator # for these values. # # # # Let's try this again with a probability of heads of `0.5` instead of # `0.3`. # In[14]: from scipy import stats d=stats.bernoulli(.5).rvs((10,5000)).mean(0) d=d[np.logical_not(np.isclose(d,1))] print 'odds ratio=',np.mean(d),'var=',np.var(d) # The odds ratio is this case is equal to one, which # is not close to what was reported. According to our # approximation, we have $\mathbb{V}(g)=0.4$, which does not # look like what our simulation just reported. This is # because the approximation is best when the odds ratio is # nearly linear and worse otherwise. # # # #
# #

The odds ratio is close to linear for small values but becomes unbounded as $p$ approaches one. The delta method is more effective for small underlying values of $p$, where the linear approximation is better.

# # #