#!/usr/bin/env python # coding: utf-8 # In[1]: from IPython.display import Image Image('../../Python_probability_statistics_machine_learning_2E.png',width=200) # It is sometimes very difficult to unequivocally attribute outcomes to causal # factors. For example, did your experiment generate the outcome you were hoping # for or not? Maybe something did happen, but the effect is not pronounced # enough # to separate it from inescapable measurement errors or other # factors in the # ambient environment? Hypothesis testing is a powerful # statistical method to # address these questions. Let's begin by again # considering our coin-tossing # experiment with unknown parameter $p$. Recall # that the individual coin-flips # are Bernoulli distributed. The first step is # to establish separate hypotheses. # First, $H_0$ is the so-called null # hypothesis. In our case this can be # # $$ # H_0 \colon \theta < \frac{1}{2} # $$ # # and the alternative hypothesis is then # # $$ # H_1 \colon \theta \geq \frac{1}{2} # $$ # # With this set up, the question now boils down to figuring out which # hypothesis # the data is most consistent with. To choose between these, we need # a # statistical test that is a function, $G$, of the sample set # $\mathbf{X}_n=\left\{ X_i \right\}_n $ into the real line, where $X_i$ is the # heads or tails outcome ($X_i \in \lbrace 0,1 \rbrace$). In other words, we # compute $G(\mathbf{X}_n)$ and check if it exceeds a threshold $c$. If not, then # we declare $H_0$ (otherwise, declare $H_1$). Notationally, this is the # following: # # $$ # \begin{align*} # G(\mathbf{X}_n) < c & \Rightarrow H_0 \\\ # G(\mathbf{X}_n) # \geq c & \Rightarrow H_1 # \end{align*} # $$ # # In summary, we have the observed data $\mathbf{X}_n$ and a function # $G$ that # maps that data onto the real line. Then, using the # constant $c$ as a threshold, # the inequality effectively divides the real line # into two parts, one # corresponding to each of the hypotheses. # # Whatever this test $G$ is, it will # make mistakes of two types --- false # negatives and false positives. The false # positives arise from the case where we # declare $H_0$ when the test says we # should declare $H_1$. This is # summarized in the Table [1](#tbl:decision). # # #

# # $$ # \begin{table} # \footnotesize # \centering # \begin{tabular}{l|p{1.3in}|p{1.3in}} # \multicolumn{1}{c}{ } & \multicolumn{1}{c}{Declare $H_0$ } & \multicolumn{1}{c}{ # Declare $H_1$ } \\ # \hline # $H_0\:$ True & Correct & False # positive (Type I error) \\ # \hline # $H_1\:$ True & False negative (Type II error) # & Correct (true-detect) \\ # \hline # \end{tabular} # \caption{Truth table for # hypotheses testing.} # \label{tbl:decision} \tag{1} # \end{table} # $$ # # For this example, here are the false positives (aka false alarms): # # $$ # P_{FA} = \mathbb{P}\left( G(\mathbf{X}_n) > c \mid \theta \leq \frac{1}{2} # \right) # $$ # # Or, equivalently, # # $$ # P_{FA} = \mathbb{P}\left( G(\mathbf{X}_n) > c \mid H_0 \right) # $$ # # Likewise, the other error is a false negative, which we can write # analogously # as # # $$ # P_{FN} = \mathbb{P}\left( G(\mathbf{X}_n) < c \vert H_1\right) # $$ # # By choosing some acceptable values for either of these errors, # we can solve for # the other one. The practice is usually to pick a value of # $P_{FA}$ and then # find the corresponding value of $P_{FN}$. Note that it is # traditional in # engineering to speak about *detection probability*, which is # defined as # # $$ # P_{D} = 1- P_{FN} = \mathbb{P}\left( G(\mathbf{X}_n) > c \mid H_1\right) # $$ # # In other words, this is the probability of declaring $H_1$ when the # test # exceeds the threshold. This is otherwise known as the *probability of a # true # detection* or *true-detect*. # # ## Back to the Coin Flipping Example # # In our # previous maximum likelihood discussion, we wanted to derive an # estimator for the # *value* of the probability of heads for the coin # flipping experiment. For # hypthesis testing, we want to ask a softer # question: is the probability of heads # greater or less than $\nicefrac{1}{2}$? As we # just established, this leads to # the two hypotheses: # # $$ # H_0 \colon \theta < \frac{1}{2} # $$ # # versus, # # $$ # H_1 \colon \theta > \frac{1}{2} # $$ # # Let's assume we have five observations. Now we need the $G$ function # and a # threshold $c$ to help pick between the two hypotheses. Let's count the # number of # heads observed in five observations as our # criterion. Thus, we have # # $$ # G(\mathbf{X}_5) := \sum_{i=1}^5 X_i # $$ # # and, suppose further that we pick $H_1$ only if exactly five out of # five # observations are heads. We'll call this the *all-heads* test. # # Now, because all # of the $X_i$ are random variables, so is $G$ and we must # find the corresponding # probability mass function for $G$. Assuming the # individual coin tosses are # independent, the probability of five heads is $\theta^5$. # This means that the # probability of rejecting the $H_0$ hypothesis (and choosing # $H_1$, because there # are only two choices here) based on the unknown underlying # probability is # $\theta^5$. In the parlance, this is known and the *power function* # as in # denoted by $\beta$ as in # # $$ # \beta(\theta) = \theta^5 # $$ # # Let's get a quick plot this in [Figure](#fig:Hypothesis_testing_001). # # # In[2]: get_ipython().run_line_magic('matplotlib', 'inline') from matplotlib.pylab import subplots import numpy as np fig,ax=subplots() fig.set_size_inches((6,3)) xi = np.linspace(0,1,50) _=ax.plot(xi, (xi)**5,'-k',label='all heads') _=ax.set_xlabel(r'$\theta$',fontsize=22) _=ax.plot(0.5,(0.5)**5,'ko') fig.tight_layout() fig.savefig('fig-statistics/Hypothesis_Testing_001.png') # # #

# #

Power function for the all-heads # test. The dark circle indicates the value of the function indicating # $\alpha$.

# # # # Now, we have the following false alarm probability, # # $$ # P_{FA} = \mathbb{P}( G(\mathbf{X}_n)= 5 \vert H_0) =\mathbb{P}( \theta^5 # \vert H_0) # $$ # # Notice that this is a function of $\theta$, which means there are # many false # alarm probability values that correspond to this test. To be on the # conservative # side, we'll pick the supremum (i.e., maximum) of this function, # which is known # as the *size* of the test, traditionally denoted by $\alpha$, # # $$ # \alpha = \sup_{\theta \in \Theta_0} \beta(\theta) # $$ # # with domain $\Theta_0 = \lbrace \theta < 1/2 \rbrace$ which in our case is # # $$ # \alpha = \sup_{\theta < \frac{1}{2}} \theta^5 = \left(\frac{1}{2}\right)^5 = # 0.03125 # $$ # # Likewise, for the detection probability, # # $$ # \mathbb{P}_{D}(\theta) = \mathbb{P}( \theta^5 \vert H_1) # $$ # # which is again a function of the parameter $\theta$. The problem with # this test # is that the $P_{D}$ is pretty low for most of the domain of # $\theta$. For # instance, values in the nineties for $P_{D}$ # only happen when $\theta > 0.98$. # In other words, if the coin produces # heads 98 times out of 100, then we can # detect $H_1$ reliably. Ideally, we want # a test that is zero for the domain # corresponding to $H_0$ (i.e., $\Theta_0$) and # equal to one otherwise. # Unfortunately, even if we increase the length of the # observed sequence, we # cannot escape this effect with this test. You can try # plotting $\theta^n$ for # larger and larger values of $n$ to see this. # # ### Majority Vote Test # # Due to the # problems with the detection probability in the all-heads test, maybe # we can # think of another test that will have the performance we want? Suppose we # reject # $H_0$ if the majority of the observations are heads. Then, using the # same # reasoning as above, we have # # $$ # \beta(\theta) = \sum_{k=3}^5 \binom{5}{k} \theta^k(1-\theta)^{5-k} # $$ # # [Figure](#fig:Hypothesis_testing_002) shows the power function # for both the # majority vote and the all-heads tests. # In[3]: fig,ax=subplots() fig.set_size_inches((6,3)) from sympy.abc import theta,k # get some variable symbols import sympy as S xi = np.linspace(0,1,50) expr=S.Sum(S.binomial(5,k)*theta**(k)*(1-theta)**(5-k),(k,3,5)).doit() _=ax.plot(xi, (xi)**5,'-k',label='all heads') _=ax.plot(xi, S.lambdify(theta,expr)(xi),'--k',label='majority vote') _=ax.plot(0.5, (0.5)**5,'ko') _=ax.plot(0.5, S.lambdify(theta,expr)(0.5),'ko') _=ax.set_xlabel(r'$\theta$',fontsize=22) _=ax.legend(loc=0) fig.tight_layout() fig.savefig('fig-statistics/Hypothesis_Testing_002.png') # # #

# #

Compares the power # function for the all-heads test with that of the majority-vote test.

# # # In this case, the new test has *size* # # $$ # \alpha = \sup_{\theta < \frac{1}{2}} \theta^{5} + 5 \theta^{4} \left(- \theta # + 1\right) + 10 \theta^{3} \left(- \theta + 1\right)^{2} = \frac{1}{2} # $$ # # As before we only get to upwards of 90% for detection # probability only when the # underlying parameter $\theta > 0.75$. # Let's see what happens when we consider # more than five samples. For # example, let's suppose that we have $n=100$ samples # and we want to # vary the threshold for the majority vote test. For example, let's # have # a new test where we declare $H_1$ when $k=60$ out of the 100 trials # turns # out to be heads. What is the $\beta$ function in this case? # # $$ # \beta(\theta) = \sum_{k=60}^{100} \binom{100}{k} \theta^k(1-\theta)^{100-k} # $$ # # This is too complicated to write by hand, but the statistics module # in Sympy # has all the tools we need to compute this. # In[4]: from sympy.stats import P, Binomial theta = S.symbols('theta',real=True) X = Binomial('x',100,theta) beta_function = P(X>60) print (beta_function.subs(theta,0.5)) # alpha print (beta_function.subs(theta,0.70)) # These results are much better than before because the $\beta$ # function is much # steeper. If we declare $H_1$ when we observe 60 out of 100 # trials are heads, # then we wrongly declare heads approximately 1.8% of the # time. Otherwise, if it # happens that the true value for $p>0.7$, we will # conclude correctly # approximately 97% of the time. A quick simulation can sanity # check these results # as shown below: # In[5]: from scipy import stats rv=stats.bernoulli(0.5) # true p = 0.5 # number of false alarms ~ 0.018 print (sum(rv.rvs((1000,100)).sum(axis=1)>60)/1000.) # The above code is pretty dense so let's unpack it. In the first line, we use # the `scipy.stats` module to define the # Bernoulli random variable for the coin # flip. Then, we use the `rvs` method of # the variable to generate 1000 trials of # the experiment where each trial # consists of 100 coin flips. This generates a # $1000 \times 100$ matrix where the # rows are the individual trials and the # columns are the outcomes of each # respective set of 100 coin flips. The # `sum(axis=1)` part computes the sum across the # columns. Because the values of # the embedded matrix are only `1` or `0` this # gives us the count of flips that # are heads per row. The next `>60` part # computes the boolean 1000-long vector of # values that are bigger than 60. The # final `sum` adds these up. Again, because # the entries in the array are `True` # or `False` the `sum` computes the count of # times the number of heads has # exceeded 60 per 100 coin flips in each of 1000 # trials. Then, dividing this # number by 1000 gives a quick approximation of false # alarm probability we # computed above for this case where the true value of # $p=0.5$. # # ## Receiver Operating Characteristic # # Because the majority vote test # is a binary test, we can compute the *Receiver # Operating Characteristic* (ROC) # which is the graph of the $(P_{FA}, # P_D)$. The term comes from radar systems but # is a very general method for # consolidating all of these issues into a single # graph. Let's consider a typical # signal processing example with two hypotheses. # In $H_0$, there is noise but no # signal present at the receiver, # # $$ # H_0 \colon X = \epsilon # $$ # # where $\epsilon \sim \mathcal{N}(0,\sigma^2)$ represents additive # noise. In the # alternative hypothesis, there is a deterministic signal at the receiver, # # $$ # H_1 \colon X = \mu + \epsilon # $$ # # Again, the problem is to choose between these two hypotheses. For # $H_0$, we # have $X \sim \mathcal{N}(0,\sigma^2)$ and for $H_1$, we have $ X \sim # \mathcal{N}(\mu,\sigma^2)$. Recall that we only observe values for $x$ and # must # pick either $H_0$ or $H_1$ from these observations. Thus, we need a # threshold, # $c$, to compare $x$ against in order to distinguish the two # hypotheses. # [Figure](#fig:Hypothesis_testing_003) shows the probability density # functions # under each of the hypotheses. The dark vertical line is the threshold # $c$. The # gray shaded area is the probability of detection, $P_D$ and the shaded # area is # the probability of false alarm, $P_{FA}$. The test evaluates every # observation # of $x$ and concludes $H_0$ if $x --> # #

# #

The two density functions for the # $H_0$ and $H_1$ hypotheses. The shaded gray area is the detection probability # and the shaded dark gray area is the probability of false alarm. The vertical # line is the decision threshold.

# # # **Programming Tip.** # # The shading shown in [Figure](#fig:Hypothesis_testing_003) # comes from # Matplotlib's `fill_between` function. This function has a `where` # keyword # argument to specify which part of the plot to apply shading with # specified # `color` keyword argument. Note there is also a `fill_betweenx` # function that # fills horizontally. The `text` function can place formatted # text # anywhere in the plot and can utilize basic \LaTeX{} formatting. # # # # As we slide # the threshold left and right along the horizontal axis, we naturally change the # corresponding areas under # each of the curves shown in # [Figure](#fig:Hypothesis_testing_003) and thereby # change the values of $P_D$ and # $P_{FA}$. The contour that emerges from sweeping # the threshold this way is the # ROC as shown in [Figure](#fig:Hypothesis_testing_004). This figure also shows # the diagonal line which # corresponds to making decisions based on the flip of a # fair coin. Any # meaningful test must do better than coin flipping so the more the # ROC bows up # to the top left corner of the graph, the better. Sometimes ROCs are # quantified # into a single number called the *area under the curve* (AUC), which # varies from # 0.5 to 1.0 as shown. In our example, what separates the two # probability density # functions is the value of $\mu$. In a real situation, this # would be determined # by signal processing methods that include many complicated # trade-offs. The key # idea is that whatever those trade-offs are, the test itself # boils down to the # separation between these two density functions --- good tests # separate the two # density functions and bad tests do not. Indeed, when there is # no separation, we # arrive at the diagonal-line coin-flipping situation we just # discussed. # # What values for $P_D$ and $P_{FA}$ are considered *acceptable* # depends on the # application. For example, suppose you are testing for a fatal # disease. It could # be that you are willing to except a relatively high $P_{FA}$ # value if that # corresponds to a good $P_D$ because the test is relatively cheap # to administer # compared to the alternative of missing a detection. On the other # hand, # may be a false alarm triggers an expensive response, so that minimizing # these alarms is more important than potentially missing a detection. These # trade-offs can only be determined by the application and design factors. # # # #

# #

The Receiver Operating Characteristic # (ROC) corresponding to [Figure](#fig:Hypothesis_testing_003).

# # # # # ## # P-Values # # There are a lot of moving parts in hypothesis testing. What we need # is # a way to consolidate the findings. The idea is that we want to find # the minimum # level at which the test rejects $H_0$. Thus, the p-value # is the probability, # under $H_0$, that the test-statistic is at least # as extreme as what was actually # observed. Informally, this means # that smaller values imply that $H_0$ should be # rejected, although # this doesn't mean that large values imply that $H_0$ should # be # retained. This is because a large p-value can arise from either $H_0$ # being # true or the test having low statistical power. # # If $H_0$ is true, the p-value is # uniformly distributed in the interval $(0,1)$. # If $H_1$ is true, the # distribution of the p-value will concentrate closer to # zero. For continuous # distributions, this can be proven rigorously and implies # that if we reject $H_0$ # when the corresponding p-value is less than $\alpha$, # then the probability of a # false alarm is $\alpha$. Perhaps it helps to # formalize this a bit before # computing it. Suppose $\tau(X)$ is a test # statistic that rejects $H_0$ as it # gets bigger. Then, for each sample $x$, # corresponding to the data we actually # have on-hand, we define # # $$ # p(x) = \sup_{\theta \in \Theta_0} \mathbb{P}_{\theta}(\tau(X) > \tau(x)) # $$ # # This equation states that the supremum (i.e., maximum) # probability that the # test statistic, $\tau(X)$, exceeds the value for # the test statistic on this # particular data ($\tau(x)$) over the # domain $\Theta_0$ is defined as the # p-value. Thus, this embodies a # worst-case scenario over all values of $\theta$. # Here's one way to think about this. Suppose you rejected $H_0$, and someone # says # that you just got *lucky* and somehow just drew data that happened to # correspond # to a rejection of $H_0$. What p-values provide is a way to address # this by # capturing the odds of just a favorable data-draw. Thus, suppose that # your # p-value is 0.05. Then, what you are showing is that the odds of just # drawing # that data sample, given $H_0$ is in force, is just 5%. This means that # there's a # 5% chance that you somehow lucked out and got a favorable draw of # data. # # Let's # make this concrete with an example. Given, the majority-vote rule above, # suppose # we actually do observe three of five heads. Given the $H_0$, the # probability of # observing this event is the following: # # $$ # p(x) =\sup_{\theta \in \Theta_0} \sum_{k=3}^5\binom{5}{k} # \theta^k(1-\theta)^{5-k} = \frac{1}{2} # $$ # # For the all-heads test, the corresponding computation is the following: # # $$ # p(x) =\sup_{\theta \in \Theta_0} \theta^5 = \frac{1}{2^5} = 0.03125 # $$ # # From just looking at these p-values, you might get the feeling that the second # test is better, but we still have the same detection probability issues we # discussed above; so, p-values help in summarizing some aspects of our # hypothesis # testing, but they do *not* summarize all the salient aspects of the # *entire* # situation. # # ## Test Statistics # # As we have seen, it is difficult to derive good # test statistics for hypothesis # testing without a systematic process. The # Neyman-Pearson Test is derived from # fixing a false-alarm value ($\alpha$) and # then maximizing the detection # probability. This results in the Neyman-Pearson # Test, # # $$ # L(\mathbf{x}) = \frac{f_{X|H_1}(\mathbf{x})}{f_{X|H_0}(\mathbf{x})} # \stackrel[H_0]{H_1}{\gtrless} \gamma # $$ # # where $L$ is the likelihood ratio and where the threshold # $\gamma$ is chosen # such that # # $$ # \int_{x:L(\mathbf{x})>\gamma} f_{X|H_0}(\mathbf{x}) d\mathbf{x}=\alpha # $$ # # The Neyman-Pearson Test is one of a family of tests that use # the likelihood # ratio. # # **Example.** Suppose we have a receiver and we want to distinguish # whether just noise ($H_0$) or signal pluse noise ($H_1$) is received. # For the # noise-only case, we have $x\sim \mathcal{N}(0,1)$ and for the # signal pluse # noise case we have $x\sim \mathcal{N}(1,1)$. In other # words, the mean of the # distribution shifts in the presence of the # signal. This is a very common problem # in signal processing and # communications. The Neyman-Pearson Test then boils down # to the # following, # # $$ # L(x)= e^{-\frac{1}{2}+x}\stackrel[H_0]{H_1}{\gtrless}\gamma # $$ # # Now we have to find the threshold $\gamma$ that solves the # maximization problem # that characterizes the Neyman-Pearson Test. Taking # the natural logarithm and # re-arranging gives, # # $$ # x\stackrel[H_0]{H_1}{\gtrless} \frac{1}{2}+\log\gamma # $$ # # The next step is find $\gamma$ corresponding to the desired # $\alpha$ by # computing it from the following, # # $$ # \int_{1/2+\log\gamma}^{\infty} f_{X|H_0}(x)dx = \alpha # $$ # # For example, taking $\alpha=1/100$, gives # $\gamma\approx 6.21$. To summarize # the test in this case, we have, # # $$ # x\stackrel[H_0]{H_1}{\gtrless} 2.32 # $$ # # Thus, if we measure $X$ and see that its value # exceeds the threshold above, we # declare $H_1$ and otherwise # declare $H_0$. The following code shows how to # solve # this example using Sympy and Scipy. First, we # set up the likelihood ratio, # In[6]: import sympy as S from sympy import stats s = stats.Normal('s',1,1) # signal+noise n = stats.Normal('n',0,1) # noise x = S.symbols('x',real=True) L = stats.density(s)(x)/stats.density(n)(x) # Next, to find the $\gamma$ value, # In[7]: g = S.symbols('g',positive=True) # define gamma v=S.integrate(stats.density(n)(x), (x,S.Rational(1,2)+S.log(g),S.oo)) # **Programming Tip.** # # Providing additional information regarding the Sympy # variable by using the # keyword argument `positive=True` helps the internal # simplification algorithms # work faster and better. This is especially useful when # dealing with complicated # integrals that involve special functions. Furthermore, # note that we used the # `Rational` function to define the `1/2` fraction, which is # another way of # providing hints to Sympy. Otherwise, it's possible that the # floating-point # representation of the fraction could disguise the simple # fraction and # thereby miss internal simplification opportunities. # # # # We want to # solve for `g` in the above expression. Sympy has some # built-in numerical solvers # as in the following, # In[8]: print (S.nsolve(v-0.01,3.0)) # approx 6.21 # Note that in this situation it is better to use the numerical # solvers because # Sympy `solve` may grind along for a long time to # resolve this. # # ### Generalized # Likelihood Ratio Test # # The likelihood ratio test can be generalized using the # following statistic, # # $$ # \Lambda(\mathbf{x})= \frac{\sup_{\theta\in\Theta_0} # L(\theta)}{\sup_{\theta\in\Theta} # L(\theta)}=\frac{L(\hat{\theta}_0)}{L(\hat{\theta})} # $$ # # where $\hat{\theta}_0$ maximizes $L(\theta)$ subject to # $\theta\in\Theta_0$ and # $\hat{\theta}$ is the maximum likelihood estimator. # The intuition behind this # generalization of the Likelihood Ratio Test is that # the denomimator is the usual # maximum likelihood estimator and the numerator is # the maximum likelihood # estimator, but over a restricted domain ($\Theta_0$). # This means that the ratio # is always less than unity because the maximum # likelihood estimator over the # entire space will always be at least as maximal # as that over the more restricted # space. When this $\Lambda$ ratio gets small # enough, it means that the maximum # likelihood estimator over the entire domain # ($\Theta$) is larger which means # that it is safe to reject the null hypothesis # $H_0$. The tricky part is that # the statistical distribution of $\Lambda$ is # usually eye-wateringly difficult. # Fortunately, Wilks Theorem says that with # sufficiently large $n$, the # distribution of $-2\log\Lambda$ is approximately # chi-square with $r-r_0$ degrees # of freedom, where $r$ is the number of free # parameters for $\Theta$ and $r_0$ is # the number of free parameters in # $\Theta_0$. With this result, if we want an # approximate test at level # $\alpha$, we can reject $H_0$ when $-2\log\Lambda \ge # \chi^2_{r-r_0}(\alpha)$ # where $\chi^2_{r-r_0}(\alpha)$ denotes the $1-\alpha$ # quantile of the # $\chi^2_{r-r_0}$ chi-square distribution. However, the problem # with this # result is that there is no definite way of knowing how big $n$ should # be. The # advantage of this generalized likelihood ratio test is that it # can test # multiple hypotheses simultaneously, as illustrated # in the following example. # **Example.** Let's return to our coin-flipping example, except now we have # three # different coins. The likelihood function is then, # # $$ # L(p_1,p_2,p_3) = # \texttt{binom}(k_1;n_1,p_1)\texttt{binom}(k_2;n_2,p_2)\texttt{binom}(k_3;n_3,p_3) # $$ # # where $\texttt{binom}$ is the binomial distribution with # the given parameters. # For example, # # $$ # \texttt{binom}(k;n,p) =\sum_{k=0}^n \binom{n}{k} p^k(1-p)^{n-k} # $$ # # The null hypothesis is that all three coins have the # same probability of # heads, $H_0:p=p_1=p_2=p_3$. The alternative hypothesis is # that at least one of # these probabilites is different. Let's consider the # numerator of the $\Lambda$ # first, which will give us the maximum likelihood # estimator of $p$. Because the # null hypothesis is that all the $p$ values are # equal, we can just treat this as # one big binomial distribution with # $n=n_1+n_2+n_3$ and $k=k_1+k_2+k_3$ is the # total number of heads observed for # any coin. Thus, under the null hypothesis, # the distribution of $k$ is binomial # with parameters $n$ and $p$. Now, what is # the maximum likelihood estimator for # this distribution? We have worked this # problem before and have the following, # # $$ # \hat{p}_0= \frac{k}{n} # $$ # # In other words, the maximum likelihood estimator under the null # hypothesis is # the proportion of ones observed in the sequence of $n$ trials # total. Now, we # have to substitute this in for the likelihood under the null # hypothesis to # finish the numerator of $\Lambda$, # # $$ # L(\hat{p}_0,\hat{p}_0,\hat{p}_0) = # \texttt{binom}(k_1;n_1,\hat{p}_0)\texttt{binom}(k_2;n_2,\hat{p}_0)\texttt{binom}(k_3;n_3,\hat{p}_0) # $$ # # For the denomimator of $\Lambda$, which represents the case of maximizing over # the entire space, the maximum likelihood estimator for each separate binomial # distribution is likewise, # # $$ # \hat{p}_i= \frac{k_i}{n_i} # $$ # # which makes the likelihood in the denominator the following, # # $$ # L(\hat{p}_1,\hat{p}_2,\hat{p}_3) = # \texttt{binom}(k_1;n_1,\hat{p}_1)\texttt{binom}(k_2;n_2,\hat{p}_2)\texttt{binom}(k_3;n_3,\hat{p}_3) # $$ # # for each of the $i\in \lbrace 1,2,3 \rbrace$ binomial distributions. Then, the # $\Lambda$ statistic is then the following, # # $$ # \Lambda(k_1,k_2,k_3) = # \frac{L(\hat{p}_0,\hat{p}_0,\hat{p}_0)}{L(\hat{p}_1,\hat{p}_2,\hat{p}_3)} # $$ # # Wilks theorems states that $-2\log\Lambda$ is chi-square # distributed. We can # compute this example with the statistics tools in Sympy and # Scipy. # In[9]: from scipy.stats import binom, chi2 import numpy as np # some sample parameters p0,p1,p2 = 0.3,0.4,0.5 n0,n1,n2 = 50,180,200 brvs= [ binom(i,j) for i,j in zip((n0,n1,n2),(p0,p1,p2))] def gen_sample(n=1): 'generate samples from separate binomial distributions' if n==1: return [i.rvs() for i in brvs] else: return [gen_sample() for k in range(n)] # **Programming Tip.** # # Note the recursion in the definition of the `gen_sample` # function where a # conditional clause of the function calls itself. This is a # quick way to reusing # code and generating vectorized output. Using `np.vectorize` # is another way, but # the code is simple enough in this case to use the # conditional clause. In # Python, it is generally bad for performance to have code # with nested recursion # because of how the stack frames are managed. However, # here we are only # recursing once so this is not an issue. # # # # Next, we compute # the logarithm of the numerator of the $\Lambda$ # statistic, # In[10]: np.random.seed(1234) # In[11]: k0,k1,k2 = gen_sample() print (k0,k1,k2) pH0 = sum((k0,k1,k2))/sum((n0,n1,n2)) numer = np.sum([np.log(binom(ni,pH0).pmf(ki)) for ni,ki in zip((n0,n1,n2),(k0,k1,k2))]) print (numer) # Note that we used the null hypothesis estimate for the $\hat{p}_0$. # Likewise, # for the logarithm of the denominator we have the following, # In[12]: denom = np.sum([np.log(binom(ni,pi).pmf(ki)) for ni,ki,pi in zip((n0,n1,n2),(k0,k1,k2),(p0,p1,p2))]) print (denom) # Now, we can compute the logarithm of the $\Lambda$ statistic as # follows and see # what the corresponding value is according to Wilks theorem, # In[13]: chsq=chi2(2) logLambda =-2*(numer-denom) print (logLambda) print (1- chsq.cdf(logLambda)) # Because the value reported above is less than the 5% significance # level, we # reject the null hypothesis that all the coins have the same # probability of # heads. Note that there are two degrees of freedom because the # difference in the # number of parameters between the null hypothesis ($p$) and # the alternative # ($p_1,p_2,p_3$) is two. We can build a quick Monte # Carlo simulation to check the # probability of detection for this example using # the following code, which is # just a combination of the last few code blocks, # In[14]: c= chsq.isf(.05) # 5% significance level out = [] for k0,k1,k2 in gen_sample(100): pH0 = sum((k0,k1,k2))/sum((n0,n1,n2)) numer = np.sum([np.log(binom(ni,pH0).pmf(ki)) for ni,ki in zip((n0,n1,n2),(k0,k1,k2))]) denom = np.sum([np.log(binom(ni,pi).pmf(ki)) for ni,ki,pi in zip((n0,n1,n2),(k0,k1,k2),(p0,p1,p2))]) out.append(-2*(numer-denom)>c) print (np.mean(out)) # estimated probability of detection # The above simulation shows the estimated probability of # detection, for this set # of example parameters. This relative low # probability of detection means that # while the test is unlikely (i.e., # at the 5% significance level) to mistakenly # pick the null hypothesis, # it is likewise missing many of the $H_1$ cases (i.e., # low probability # of detection). The trade-off between which is more important is # up to # the particular context of the problem. In some situations, we may # prefer # additional false alarms in exchange for missing fewer $H_1$ # cases. # # # ### # Permutation Test # # # # # # # # The Permutation Test is good way to test whether or not # samples # samples come from the same distribution. For example, suppose that # # $$ # X_1, X_2, \ldots, X_m \sim F # $$ # # and also, # # $$ # Y_1, Y_2, \ldots, Y_n \sim G # $$ # # That is, $Y_i$ and $X_i$ come from different distributions. Suppose # we have # some test statistic, for example # # $$ # T(X_1,\ldots,X_m,Y_1,\ldots,Y_n) = \vert\overline{X}-\overline{Y}\vert # $$ # # Under the null hypothesis for which $F=G$, any of the # $(n+m)!$ permutations are # equally likely. Thus, suppose for # each of the $(n+m)!$ permutations, we have the # computed # statistic, # # $$ # \lbrace T_1,T_2,\ldots,T_{(n+m)!} \rbrace # $$ # # Then, under the null hypothesis, each of these values is equally # likely. The # distribution of $T$ under the null hypothesis is the *permutation # distribution* # that puts weight $1/(n+m)!$ on each $T$-value. Suppose $t_o$ is # the observed # value of the test statistic and assume that large $T$ rejects the # null # hypothesis, then the p-value for the permutation test is the following, # # $$ # P(T>t_o)= \frac{1}{(n+m)!} \sum_{j=1}^{(n+m)!} I(T_j>t_o) # $$ # # where $I()$ is the indicator function. For large $(n+m)!$, we can # sample # randomly from the set of all permutations to estimate this p-value. # **Example.** Let's return to our coin-flipping example from last time, but # now # we have only two coins. The hypothesis is that both coins # have the same # probability of heads. We can use the built-in # function in Numpy to compute the # random permutations. # In[15]: x=binom(10,0.3).rvs(5) # p=0.3 y=binom(10,0.5).rvs(3) # p=0.5 z = np.hstack([x,y]) # combine into one array t_o = abs(x.mean()-y.mean()) out = [] # output container for k in range(1000): perm = np.random.permutation(z) T=abs(perm[:len(x)].mean()-perm[len(x):].mean()) out.append((T>t_o)) print ('p-value = ', np.mean(out)) # Note that the size of total permutation space is # $8!=40320$ so we are taking # relatively few (i.e., 100) random # permutations from this space. # # ### Wald Test # The Wald Test is an asympotic test. Suppose we have $H_0:\theta=\theta_0$ and # otherwise $H_1:\theta\ne\theta_0$, the corresponding statistic is defined as # the # following, # # $$ # W=\frac{\hat{\theta}_n-\theta_0}{\texttt{se}} # $$ # # where $\hat{\theta}$ is the maximum likelihood estimator and # $\texttt{se}$ is # the standard error, # # $$ # \texttt{se} = \sqrt{\mathbb{V}(\hat{\theta}_n)} # $$ # # Under general conditions, $W\overset{d}{\to} \mathcal{N}(0,1)$. # Thus, an # asympotic test at level $\alpha$ rejects when $\vert W\vert> # z_{\alpha/2}$ where # $z_{\alpha/2}$ corresponds to $\mathbb{P}(\vert # Z\vert>z_{\alpha/2})=\alpha$ # with $Z \sim \mathcal{N}(0,1)$. For our favorite # coin-flipping example, if # $H_0:\theta=\theta_0$, then # # $$ # W = \frac{\hat{\theta}-\theta_0}{\sqrt{\hat{\theta}(1-\hat{\theta})/n}} # $$ # # We can simulate this using the following code at the usual # 5% significance # level, # In[16]: from scipy import stats theta0 = 0.5 # H0 k=np.random.binomial(1000,0.3) theta_hat = k/1000. # MLE W = (theta_hat-theta0)/np.sqrt(theta_hat*(1-theta_hat)/1000) c = stats.norm().isf(0.05/2) # z_{alpha/2} print (abs(W)>c) # if true, reject H0 # This rejects $H_0$ because the true $\theta=0.3$ and the null hypothesis # is # that $\theta=0.5$. Note that $n=1000$ in this case which puts us well inside # the # asympotic range of the result. We can re-do this example to estimate # the # detection probability for this example as in the following code, # In[17]: theta0 = 0.5 # H0 c = stats.norm().isf(0.05/2.) # z_{alpha/2} out = [] for i in range(100): k=np.random.binomial(1000,0.3) theta_hat = k/1000. # MLE W = (theta_hat-theta0)/np.sqrt(theta_hat*(1-theta_hat)/1000.) out.append(abs(W)>c) # if true, reject H0 print (np.mean(out)) # detection probability # ## Testing Multiple Hypotheses # # Thus far, we have focused primarily on two # competing hypotheses. Now, we # consider multiple comparisons. The general # situation is the following. We test # the null hypothesis against a sequence of # $n$ competing hypotheses $H_k$. We # obtain p-values for each hypothesis so now # we have multiple p-values to # consider $\lbrace p_k \rbrace$. To boil this # sequence down to a single # criterion, we can make the following argument. Given # $n$ independent hypotheses # that are all untrue, the probability of getting at # least one false alarm is the # following, # # $$ # P_{FA} = 1-(1-p_0)^n # $$ # # where $p_0$ is the individual p-value threshold (say, 0.05). The # problem here # is that $P_{FA}\rightarrow 1$ as $n\rightarrow\infty$. If we want # to make many # comparisons at once and control the overall false alarm rate the # overall p-value # should be computed under the assumption that none of the # competing hypotheses is # valid. The most common way to address this is with the # Bonferroni correction # which says that the individual significance level should # be reduced to $p/n$. # Obviously, this makes it much harder to declare # significance for any particular # hypothesis. The natural consequence of this # conservative restriction is to # reduce the statistical power of the experiment, # thus making it more likely the # true effects will be missed. # # In 1995, Benjamini and Hochberg devised a simple # method that tells which # p-values are statistically significant. The procedure is # to sort the list of # p-values in ascending order, choose a false-discovery rate # (say, $q$), and then # find the largest p-value in the sorted list such that $p_k # \le k q/n$, where # $k$ is the p-value's position in the sorted list. Finally, # declare that $p_k$ # value and all the others less than it statistically # significant. This procedure # guarantees that the proportion of false-positives is # less than $q$ (on # average). The Benjamini-Hochberg procedure (and its # derivatives) is fast and # effective and is widely used for testing hundreds of # primarily false hypotheses # when studying genetics or diseases. Additionally, # this # procedure provides better statistical power than the Bonferroni correction. # # # # # # # ## Fisher Exact Test # # # # # # # # # # # # # # #

# $$ # \begin{table}[] # \centering # \caption{Example Contingency Table} # \label{tab:contingencyTable} \tag{2} # \begin{tabular}{lllll} # \cline{1-4} # \multicolumn{1}{|l|}{} & \multicolumn{1}{l|}{Infection} & # \multicolumn{1}{l|}{No infection} & \multicolumn{1}{l|}{Total} & \\ \cline{1-4} # \multicolumn{1}{|l|}{Male} & \multicolumn{1}{c|}{13} & # \multicolumn{1}{c|}{11} & \multicolumn{1}{c|}{24} & \\ \cline{1-4} # \multicolumn{1}{|l|}{Female} & \multicolumn{1}{c|}{12} & # \multicolumn{1}{c|}{1} & \multicolumn{1}{c|}{13} & \\ \cline{1-4} # \multicolumn{1}{|l|}{Total} & \multicolumn{1}{c|}{25} & # \multicolumn{1}{c|}{12} & \multicolumn{1}{c|}{37} & \\ \cline{1-4} # \end{tabular} # \end{table} # $$ # # Contingency tables represent the partitioning of a sample population of two # categories between two different classifications as shown in the following # Table # [2](#tab:contingencyTable). The question is whether or not the observed # table # corresponds to a random partition of the sample population, constrained # by the # marginal sums. Note that because this is a two-by-two table, a change in # any of # the table entries automatically affects all of the other terms because # of the # row and column sum constraints. This means that equivalent questions # like "Under # a random partition, what is the probability that a particular # table entry is at # least as large as a given value?" can be meaningfully posed. # # # The Fisher Exact Test addresses this question. The idea is to compute the # probability of a particular entry of the table, conditioned upon the marginal # row and column sums, # # $$ # \mathbb{P}(X_{i,j}\vert r_1,r_2,c_1,c_2) # $$ # # where $X_{i,j}$ is $(i,j)$ table entry, $r_1$ represents the sum of # the first # row, $r_2$ represents the sum of the second row, $c_1$ represents the # sum of the # first column, and $c_2$ is the sum of the second column. This # probability is # given by the *hypergeometric distribution*. Recall that the # hypergeometric # distribution gives the probability of sampling (without # replacement) $k$ items # from a population of $N$ items consisting of exactly two # different kinds of # items, # # $$ # \mathbb{P}(X=k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} # $$ # # where $N$ is the population size, $K$ is the total number of possible # favorable # draws, $n$ is the number of draws, and $k$ is the number of observed # favorable # draws. With the corresponding identification of variables, the # hypergeometric # distribution gives the desired conditional probability: $K=r_1, # k=x, n= c_1, # N=r_1+r_2$. # # In the example of the Table [2](#tab:contingencyTable), the # probability for # $x=13$ male infections among a population of $r_1=24$ males in a # total # population of $c_1=25$ infected persons, including $r_2=13$ females. The # `scipy.stats` module has the Fisher Exact Test implemented as shown below, # In[18]: import scipy.stats table = [[13,11],[12,1]] odds_ratio, p_value=scipy.stats.fisher_exact(table) print(p_value) # The default for `scipy.stats.fisher_exact` is the two-sided # test. The following # result is for the `less` option, # In[19]: import scipy.stats odds_ratio, p_value=scipy.stats.fisher_exact(table,alternative='less') print(p_value) # This means that the p-value is computed by summing over the # probabilities of # contingency tables that are *less* extreme than the # given table. To undertand # what this means, we can use # the `scipy.stats.hypergeom` function to compute the # probabilities of # these with the number of infected men is less than or equal to # 13. # In[20]: hg = scipy.stats.hypergeom(37, 24, 25) probs = [(hg.pmf(i)) for i in range(14)] print (probs) print(sum(probs)) # This is the same as the prior p-value result we obtained from # `scipy.stats.fisher_exact`. Another option is `greater` which derives from the # following analogous summation, # In[21]: odds_ratio, p_value=scipy.stats.fisher_exact(table,alternative='greater') probs = [hg.pmf(i) for i in range(13,25)] print(probs) print(p_value) print(sum(probs)) # Finally, the two-sided version excludes those individual # table probabilities # that are less that of the given table # In[22]: _,p_value=scipy.stats.fisher_exact(table) probs = [ hg.pmf(i) for i in range(25) ] print(sum(i for i in probs if i<= hg.pmf(13))) print(p_value) # Thus, for this particular contingency table, we # could reasonably conclude that # 13 infected males in this total # population is statistically significant with a # p-value less than # five percent. # # Performing this kind of analysis for tables # larger than `2x2` easily becomes # computationally challenging due to the nature # of the underlying combinatorics and # usually requires specialized # approximations. # # In this section, we discussed the structure of statistical # hypothesis testing # and defined the various terms that are commonly used for # this process, along # with the illustrations of what they mean in our running # coin-flipping example. # From an engineering standpoint, hypothesis testing is not # as common as # confidence-intervals and point estimates. On the other hand, # hypothesis testing # is very common in social and medical science, where one must # deal with # practical constraints that may limit the sample size or other aspects # of the # hypothesis testing rubric. In engineering, we can usually have much more # control over the samples and models we employ because they are typically # inanimate objects that can be measured repeatedly and consistently. This is # obviously not so with human studies, which generally have other ethical and # legal considerations.