#!/usr/bin/env python # coding: utf-8 # In[1]: from IPython.display import Image Image('../../Python_probability_statistics_machine_learning_2E.png',width=200) # This chapter takes a geometric view of probability theory and relates it to # familiar concepts in linear algebra and geometry. This approach connects your # natural geometric intuition to the key abstractions in probability that can # help # guide your reasoning. This is particularly important in probability # because it # is easy to be misled. We need a bit of rigor and some # intuition to guide us. # # In # grade school, you were introduced to the natural numbers (i.e., `1,2,3,..`) # and # you learned how to manipulate them by operations like addition, # subtraction, and # multiplication. Later, you were introduced to positive and # negative numbers and # were again taught how to manipulate them. Ultimately, you # were introduced to the # calculus of the real line, and learned how to # differentiate, take limits, and so # on. This progression provided more # abstractions, but also widened the field of # problems you could successfully # tackle. The same is true of probability. One way # to think about probability is # as a new number concept that allows you to tackle # problems that have a special # kind of *uncertainty* built into them. Thus, the # key idea is that there is some # number, say $x$, with a traveling companion, say, # $f(x)$, and this companion # represents the uncertainties about the value of $x$ # as if looking at the number # $x$ through a frosted window. The degree of opacity # of the window is # represented by $f(x)$. If we want to manipulate $x$, then we # have to figure # out what to do with $f(x)$. For example if we want $y= 2 x $, # then we have to # understand how $f(x)$ generates $f(y)$. # # Where is the *random* # part? To conceptualize this, we need still another # analogy: think about a # beehive with the swarm around it representing $f(x)$, # and the hive itself, which # you can barely see through the swarm, as $x$. The # random piece is you don't # know *which* bee in particular is going to sting you! # Once this happens the # uncertainty evaporates. # Up until that happens, all we have is a concept of a # swarm (i.e., density of # bees) which represents a *potentiality* of which bee # will ultimately sting. # In summary, one way to think about probability is as a # way of carrying through # mathematical reasoning (e.g., adding, subtracting, # taking # limits) with a notion of potentiality that is so-transformed by these # operations. # # ## Understanding Probability Density # # In order to understand the # heart of modern probability, which is built # on the Lesbesgue theory of # integration, we need to extend the concept # of integration from basic calculus. # To begin, let us consider the # following piecewise function # # $$ # f(x) = \begin{cases} # 1 & \mbox{if } 0 < x \leq 1 \\\ # 2 & \mbox{if } 1 < x \leq 2 \\\ # 0 & \mbox{otherwise } # \end{cases} # $$ # # as shown in [Figure](#fig:intro_001). In calculus, you learned # Riemann # integration, which you can apply here as # # # #
# #

Simple piecewise-constant function.

# # # # # $$ # \int_0^2 f(x) dx = 1 + 2 = 3 # $$ # # which has the usual interpretation as the area of the two rectangles # that make # up $f(x)$. So far, so good. # # With Lesbesgue integration, the idea is very # similar except that we # focus on the y-axis instead of moving along the x-axis. # The question # is given $f(x) = 1$, what is the set of $x$ values for which this # is # true? For our example, this is true whenever $x\in (0,1]$. So now we # have a # correspondence between the values of the function (namely, `1` # and `2`) and the # sets of $x$ values for which this is true, namely, # $\lbrace (0,1] \rbrace$ and # $\lbrace (1,2] \rbrace$, respectively. To # compute the integral, we simply take # the function values (i.e., `1,2`) # and some way of measuring the size of the # corresponding interval # (i.e., $\mu$) as in the following: # # $$ # \int_0^2 f d\mu = 1 \mu(\lbrace (0,1] \rbrace) + 2 \mu(\lbrace (1,2] \rbrace) # $$ # # We have suppressed some of the notation above to emphasize generality. Note # that # we obtain the same value of the integral as in the Riemann case when # $\mu((0,1]) # = \mu((1,2]) = 1$. By introducing the $\mu$ function as a way of # measuring the # intervals above, we have introduced another degree of freedom in # our # integration. This accommodates many weird functions that are not tractable # using # the usual Riemann theory, but we refer you to a proper introduction to # Lesbesgue # integration for further study [[jones2001lebesgue]](#jones2001lebesgue). # Nonetheless, # the key step in the above discussion is the introduction of the # $\mu$ function, # which we will encounter again as the so-called probability # density function. # # ## Random Variables # # Most introductions to probability jump # straight into *random variables* and # then explain how to compute complicated # integrals. The problem with this # approach is that it skips over some of the # important subtleties that we will now # consider. Unfortunately, the term *random # variable* is not very descriptive. A # better term is *measurable function*. To # understand why this is a better term, # we have to dive into the formal # constructions of probability by way of a simple # example. # # Consider tossing a # fair six-sided die. There are only six outcomes possible, # # $$ # \Omega=\lbrace 1,2,3,4,5,6 \rbrace # $$ # # As we know, if the die is fair, then the probability of each outcome is $1/6$. # To say this formally, the measure of each set (i.e., $\lbrace 1 \rbrace,\lbrace # 2 \rbrace,\ldots,\lbrace 6 \rbrace$) is $\mu(\lbrace 1 \rbrace ) =\mu(\lbrace 2 # \rbrace ) \ldots = \mu(\lbrace 6 \rbrace ) = 1/6$. In this case, the $\mu$ # function we discussed earlier is the usual *probability* mass function, denoted # by # $\mathbb{P}$. The measurable function maps a set into a # number on the real # line. For example, $ \lbrace 1 \rbrace \mapsto 1 $ is # one such function. # # Now, # here's where things get interesting. Suppose you were asked to construct a # fair # coin from the fair die. In other words, we want to throw the die and then # record # the outcomes as if we had just tossed a fair coin. How could we do this? # One way # would be to define a measurable function that says if the die comes up # `3` or # less, then we declare *heads* and otherwise declare *tails*. This has # some # strong intuition behind it, but let's articulate it in terms of formal # theory. # This strategy creates two different non-overlapping sets $\lbrace # 1,2,3 \rbrace$ # and $\lbrace 4,5,6 \rbrace$. Each set has the same probability # *measure*, # # $$ # \begin{eqnarray*} # \mathbb{P}(\lbrace 1,2,3 \rbrace) & = & 1/2 \\\ # \mathbb{P}(\lbrace 4,5,6 \rbrace) & = & 1/2 # \end{eqnarray*} # $$ # # And the problem is solved. Everytime the die comes up # $\lbrace 1,2,3 \rbrace$, # we record heads and record tails otherwise. # # Is this the only way to construct a # fair coin experiment from a # fair die? Alternatively, we can define the sets as # $\lbrace 1 \rbrace$, # $\lbrace 2 \rbrace$, $\lbrace 3,4,5,6 \rbrace$. If we # define the corresponding # measure for each set as the following # # $$ # \begin{eqnarray*} # \mathbb{P}(\lbrace 1 \rbrace) & = & 1/2 \\\ # \mathbb{P}(\lbrace 2 \rbrace) & = & 1/2 \\\ # \mathbb{P}(\lbrace 3,4,5,6 \rbrace) # & = & 0 # \end{eqnarray*} # $$ # # then, we have another solution to the fair coin problem. To # implement this, # all we do is ignore every time the die shows `3,4,5,6` and # throw again. This is # wasteful, but it solves the problem. Nonetheless, # we hope you can see how the # interlocking pieces of the theory provide a # framework for carrying the notion of # uncertainty/potentiality from one problem # to the next (e.g., from the fair die # to the fair coin). # # Let's consider a slightly more interesting problem where we # toss two dice. We # assume that each throw is *independent*, meaning that the # outcome of one does # not influence the other. What are the sets in this case? # They are all pairs # of possible outcomes from two throws as shown below, # # $$ # \Omega = \lbrace (1,1),(1,2),\ldots,(5,6),(6,6) \rbrace # $$ # # What are the measures of each of these sets? By virtue of the # independence # claim, the measure of each is the product of the respective measures # of each # element. For instance, # # $$ # \mathbb{P}((1,2)) = \mathbb{P}(\lbrace 1 \rbrace) \mathbb{P}(\lbrace 2 # \rbrace) = \frac{1}{6^2} # $$ # # With all that established, we can ask the following # question: what is the # probability that the sum of the dice equals # seven? As before, the first thing to # do is characterize the # measurable function for this as $X:(a,b) \mapsto (a+b)$. # Next, we # associate all of the $(a,b)$ pairs with their sum. We can create a # Python dictionary for this as shown, # In[2]: d={(i,j):i+j for i in range(1,7) for j in range(1,7)} # The next step is to collect all of the $(a,b)$ pairs that sum to # each of the # possible values from two to twelve. # In[3]: from collections import defaultdict dinv = defaultdict(list) for i,j in d.items(): dinv[j].append(i) # **Programming Tip.** # # The `defaultdict` object from the built-in collections # module creates dictionaries with # default values when it encounters a new key. # Otherwise, we would have had to # create default values manually for a regular # dictionary. # # # # For example, `dinv[7]` contains the following list of pairs that # sum to seven, # In[4]: dinv[7] # The next step is to compute the probability measured for each of these items. # Using the independence assumption, this means we have to compute the sum of the # products of the individual item probabilities in `dinv`. Because we know that # each outcome is equally likely, the probability of every term in the sum equals # $1/36$. Thus, all # we have to do is count the number of items in the # corresponding list for each # key in `dinv` and divide by `36`. For example, # `dinv[11]` contains `[(5, 6), # (6, 5)]`. The probability of `5+6=6+5=11` is the # probability of this set which # is composed of the sum of the probabilities of the # individual elements # `{(5,6),(6,5)}`. In this case, we have $\mathbb{P}(11) = # \mathbb{P}(\lbrace # (5,6) \rbrace)+ \mathbb{P}(\lbrace (6,5) \rbrace) = 1/36 + # 1/36 = 2/36$. # Repeating this procedure for all the elements, we derive the # probability mass # function as shown below, # In[5]: X={i:len(j)/36. for i,j in dinv.items()} print(X) # **Programming Tip.** # # In the preceding code note that `36.` is written with # the # trailing decimal mark. This is a good habit to get into # because the default # division operation changed between Python 2.x and # Python 3.x. In Python 2.x # division is integer division by default, # and it is floating-point division in # Python 3.x. # # # # The above example exposes the elements of probability theory that # are in play for this simple problem while deliberately suppressing some of the # gory technical details. With this framework, we can ask other questions like # what is the probability that half the product of three dice will exceed the # their sum? We can solve this using the same method as in the following. First, # let's create the first mapping, # In[6]: d={(i,j,k):((i*j*k)/2>i+j+k) for i in range(1,7) for j in range(1,7) for k in range(1,7)} # The keys of this dictionary are the triples and the values are the # logical # values of whether or not half the product of three dice exceeds their sum. # Now, # we do the inverse mapping to collect the corresponding lists, # In[7]: dinv = defaultdict(list) for i,j in d.items(): dinv[j].append(i) # Note that `dinv` contains only two keys, `True` and `False`. Again, # because the # dice are independent, the probability of any triple is $1/6^3$. # Finally, we # collect this for each outcome as in the following, # In[8]: X={i:len(j)/6.0**3 for i,j in dinv.items()} print(X) # Thus, the probability of half the product of three dice exceeding their sum is # `136/(6.0**3) = 0.63`. The set that is induced by the random variable has only # two elements in it, `True` and `False`, with $\mathbb{P}(\mbox{True})=136/216$ # and $\mathbb{P}(\mbox{False})=1-136/216$. # # As a final example to exercise # another layer of generality, let is consider the # first problem with the two dice # where we want the probability of a # seven, but this time one of the dice is no # longer fair. The distribution for # the unfair die is the following: # # $$ # \begin{eqnarray*} # \mathbb{P}(\lbrace 1\rbrace)=\mathbb{P}(\lbrace 2 # \rbrace)=\mathbb{P}(\lbrace 3 \rbrace) = \frac{1}{9} \\\ # \mathbb{P}(\lbrace # 4\rbrace)=\mathbb{P}(\lbrace 5 \rbrace)=\mathbb{P}(\lbrace 6 \rbrace) = # \frac{2}{9} # \end{eqnarray*} # $$ # # From our earlier work, we know the elements corresponding to the sum of seven # are the following: # # $$ # \lbrace (1,6),(2,5),(3,4),(4,3),(5,2),(6,1) \rbrace # $$ # # Because we still have the independence assumption, all we need to # change is # the probability computation of each of elements. For example, given # that the # first die is the unfair one, we have # # $$ # \mathbb{P}((1,6)) = \mathbb{P}(1)\mathbb{P}(6) = \frac{1}{9} \times # \frac{1}{6} # $$ # # and likewise for $(2,5)$ we have the following: # # $$ # \mathbb{P}((2,5)) = \mathbb{P}(2)\mathbb{P}(5) = \frac{1}{9} \times # \frac{1}{6} # $$ # # and so forth. Summing all of these gives the following: # # $$ # \mathbb{P}_X(7) = \frac{1}{9} \times \frac{1}{6} # +\frac{1}{9} \times \frac{1}{6} # +\frac{1}{9} \times # \frac{1}{6} # +\frac{2}{9} \times \frac{1}{6} # +\frac{2}{9} \times \frac{1}{6} # +\frac{2}{9} \times # \frac{1}{6} = \frac{1}{6} # $$ # # Let's try computing this using Pandas instead # of Python dictionaries. First, we # construct # a `DataFrame` object with an index of tuples # consisting of all pairs # of possible dice outcomes. # In[9]: from pandas import DataFrame d=DataFrame(index=[(i,j) for i in range(1,7) for j in range(1,7)], columns=['sm','d1','d2','pd1','pd2','p']) # Now, we can populate the columns that we set up above # where the outcome of the # first die is the `d1` column and # the outcome of the second die is `d2`, # In[10]: d.d1=[i[0] for i in d.index] d.d2=[i[1] for i in d.index] # Next, we compute the sum of the dices in the `sm` # column, # In[11]: d.sm=list(map(sum,d.index)) # With that established, the DataFrame now looks like # the following: # In[12]: d.head(5) # show first five lines # Next, we fill out the probabilities for each face of the # unfair die (`d1`) and # the fair die (`d2`), # In[13]: d.loc[d.d1<=3,'pd1']=1/9. d.loc[d.d1 > 3,'pd1']=2/9. d.pd2=1/6. d.head(10) # Finally, we can compute the joint probabilities # for the sum of the shown faces # as the following: # In[14]: d.p = d.pd1 * d.pd2 d.head(5) # With all that established, we can compute the # density of all the dice outcomes # by using `groupby` as in the # following, # In[15]: d.groupby('sm')['p'].sum() # These examples have shown how the theory of probability # breaks down sets and # measurements of those sets and how these can be # combined to develop the # probability mass functions for new random # variables. # # ## Continuous Random # Variables # # The same ideas work with continuous variables but managing the sets # becomes trickier because the real line, unlike discrete sets, has many # limiting # properties already built into it that have to be handled # carefully. # Nonetheless, let's start with an example that should # illustrate the analogous # ideas. Suppose a random variable $X$ is # uniformly distributed on the unit # interval. What is the probability # that the variable takes on values less than # 1/2? # # In order to build intuition onto the discrete case, let's go back to our # dice-throwing experiment with the fair dice. The sum of the values of the dice # is a measurable function, # # $$ # Y \colon \lbrace 1,2,\dots,6 \rbrace^2 \mapsto \lbrace 2,3,\ldots, 12 \rbrace # $$ # # That is, $Y$ is a mapping of the cartesian product of sets to a # discrete set of # outcomes. In order to compute probabilities of the set of # outcomes, we need to # derive the probability measure for $Y$, $\mathbb{P}_Y$, # from the corresponding # probability measures for each die. Our previous discussion # went through the # mechanics of that. This means that # # $$ # \mathbb{P}_Y \colon \lbrace 2,3,\ldots,12 \rbrace \mapsto [0,1] # $$ # # Note there is a separation between the function definition and where the # target # items of the function are measured in probability. More bluntly, # # $$ # Y \colon A \mapsto B # $$ # # with, # # $$ # \mathbb{P}_Y \colon B \mapsto [0,1] # $$ # # Thus, to compute $\mathbb{P}_Y$, which is derived # from other random variables, # we have to express the equivalence classes # in $B$ in terms of their progenitor # $A$ sets. # # The situation for continuous variables follows the same pattern, but # with many more deep technicalities that we are going to skip. For the continuous # case, the random variable is now, # # $$ # X \colon \mathbb{R} \mapsto \mathbb{R} # $$ # # with corresponding probability measure, # # $$ # \mathbb{P}_X \colon \mathbb{R} \mapsto [0,1] # $$ # # But where are the corresponding sets here? Technically, these are the # *Borel* # sets, but we can just think of them as intervals. Returning to our # question, # what is the probability that a uniformly distributed random variable # on the unit # interval takes values less than $1/2$? Rephrasing this question # according to the # framework, we have the following: # # $$ # X \colon [0,1] \mapsto [0,1] # $$ # # with corresponding, # # $$ # \mathbb{P}_X \colon [0,1] \mapsto [0,1] # $$ # # To answer the question, by the definition of the uniform random # variable on # the unit interval, we compute the following integral, # # $$ # \mathbb{P}_X([0,1/2]) = \mathbb{P}_X(0 < X < 1/2) = \int_0^{1/2} dx = 1/2 # $$ # # where the above integral's $dx$ sweeps through intervals of the # $B$-type. The # measure of any $dx$ interval (i.e., $A$-type set) is equal to # $dx$, by # definition of the uniform random variable. To get all the moving parts # into one # notationally rich integral, we can also write this as, # # $$ # \mathbb{P}_X(0 < X < 1/2) = \int_0^{ 1/2 } d\mathbb{P}_X(dx) = 1/2 # $$ # # Now, let's consider a slightly more complicated and interesting example. As # before, suppose we have a uniform random variable, $X$ and let us introduce # another random variable defined, # # $$ # Y = 2 X # $$ # # Now, what is the probability that $0 < Y < \frac{1}{2}$? # To express this in # our framework, we write, # # $$ # Y \colon [0,1] \mapsto [0,2] # $$ # # with corresponding, # # $$ # \mathbb{P}_Y \colon [0,2] \mapsto [0,1] # $$ # # To answer the question, we need to measure the set $[0,1/2]$, with # the # probability measure for $Y$, $\mathbb{P}_Y([0,1/2])$. How can we do this? # Because $Y$ is derived from the $X$ random variable, as with the fair-dice # throwing experiment, we have to create a set of equivalences in the target # space # (i.e., $B$-type sets) that reflect back on the input space (i.e., # $A$-type # sets). That is, what is the interval $[0,1/2]$ equivalent to in terms # of the $X$ # random variable? Because, functionally, $Y=2 X$, then the $B$-type # interval # $[0,1/2]$ corresponds to the $A$-type interval $[0,1/4]$. From the # probability # measure of $X$, we compute this with the integral, # # $$ # \mathbb{P}_Y([0,1/2]) =\mathbb{P}_X([0,1/4])= \int_0^{1/4} dx = 1/4 # $$ # # Now, let's up the ante and consider the following random variable, # # $$ # Y = X^2 # $$ # # where now $X$ is still uniformly distributed, but now over the # interval # $[-1/2,1/2]$. We can express this in our framework as, # # $$ # Y \colon [-1/2,1/2] \mapsto [0,1/4] # $$ # # with corresponding, # # $$ # \mathbb{P}_Y \colon [0,1/4] \mapsto [0,1] # $$ # # What is the $\mathbb{P}_Y(Y < 1/8)$? In other words, what is the # measure of # the set $B_Y= [0,1/8]$? As before, because $X$ is derived from our # uniformly # distributed random variable, we have to reflect the $B_Y$ set onto # sets of the # $A$-type. The thing to recognize is that because $X^2$ # is symmetric about zero, # all $B_Y$ sets reflect back into two sets. # This means that for any set $B_Y$, we # have the correspondence $B_Y = A_X^+ \cup # A_X^{-}$. So, we have, # # $$ # B_Y=\Big\lbrace 00$. In this case, $Y>X$ because $Z$ cannot be positive # otherwise. For the density function, we are interested in the set # $\lbrace 0 # < Z < z \rbrace $. We want to compute # # $$ # \mathbb{P}(Z X$. For $Z < z $, # we have $Y > X(1/z+1)$. # Putting this together gives # # $$ # A_1 = \lbrace \max (X,X(1/z+1)) < Y < 1 \rbrace # $$ # # Integrating this over $Y$ as follows, # # $$ # \int_0^1\lbrace\max(X,X(1/z+1)) \frac{X}{1-X} # $$ # # and integrating this one more time over $X$ gives # # $$ # \int_0^{\frac{z}{1+z}} \frac{-X+z-Xz}{z} dX = \frac{z}{2(z+1)} \mbox{ where } # z > 0 # $$ # # Note that this is the computation for the *probability* # itself, not the # probability density function. To get that, all we have # to do is differentiate # the last expression to obtain # # $$ # f_Z(z) = \frac{1}{(z+1)^2} \mbox{ where } z > 0 # $$ # # Now we need to compute this density using the same process # for when $z < -1$. # We want the interval $ Z < z $ for when $z < -1$. # For a fixed $z$, this is # equivalent to $ X(1+1/z) < Y$. Because $z$ # is negative, this also means that $Y # < X$. Under these terms, we # have the following integral, # # $$ # \int_0^1 \lbrace X(1/z+1) 0 \\\ # \frac{1}{2 z^2} & \mbox{if } z < -1 \\\ # 0 & \mbox{otherwise } # \end{cases} # $$ # # We will leave it as an exercise to show that this # integrates out to one. # # # ## Independent Random Variables # # Independence is a standard assumption. # Mathematically, the # necessary and sufficient condition for independence between # two # random variables $X$ and $Y$ is the following: # # $$ # \mathbb{P}(X,Y) = \mathbb{P}(X)\mathbb{P}(Y) # $$ # # Two random variables $X$ and $Y$ # are *uncorrelated* if, # # $$ # \mathbb{E}\left( (X-\overline{X})(Y-\overline{Y}) \right)=0 # $$ # # where $\overline{X}=\mathbb{E}(X)$ Note that uncorrelated random # variables are # sometimes called *orthogonal* random variables. Uncorrelatedness # is a weaker # property than independence, however. For example, consider the # discrete random # variables $X$ and $Y$ uniformly distributed over the set # $\lbrace 1,2,3 \rbrace$ # where # # $$ # X = # \begin{cases} # 1 & \mbox{if } \omega =1 \\\ # 0 & \mbox{if } \omega =2 # \\\ # -1 & \mbox{if } \omega =3 # \end{cases} # $$ # # and also, # # $$ # Y = # \begin{cases} # 0 & \mbox{if } \omega =1 \\\ # 1 & \mbox{if } \omega =2 # \\\ # 0 & \mbox{if } \omega =3 # \end{cases} # $$ # # Thus, $\mathbb{E}(X)=0$ and $\mathbb{E}(X Y)=0$, so # $X$ and $Y$ are # uncorrelated. However, we have # # $$ # \mathbb{P}(X=1,Y=1)=0\neq \mathbb{P}(X=1)\mathbb{P}(Y=1)=\frac{1}{9} # $$ # # So, these two random variables are *not* independent. # Thus, uncorrelatedness # does not imply independence, generally, but # there is the important case of # Gaussian random variables for which # it does. To see this, consider the # probability density function # for two zero-mean, unit-variance Gaussian random # variables $X$ and # $Y$, # # $$ # f_{X,Y}(x,y) = \frac{e^{\frac{x^2-2 \rho x # y+y^2}{2 # \left(\rho^2-1\right)}}}{2 \pi # \sqrt{1-\rho^2}} # $$ # # where $\rho:=\mathbb{E}(X Y)$ is the correlation coefficient. In # the # uncorrelated case where $\rho=0$, the probability density function factors # into # the following, # # $$ # f_{X,Y}(x,y)=\frac{e^{-\frac{1}{2}\left(x^2+y^2\right)}}{2\pi}=\frac{e^{-\frac{x^2}{2}}}{\sqrt{2\pi}}\frac{e^{-\frac{y^2}{2}}}{\sqrt{2\pi}} # =f_X(x)f_Y(y) # $$ # # which means that $X$ and $Y$ are independent. # # Independence and conditional # independence are closely related, as in the following: # # $$ # \mathbb{P}(X,Y\vert Z) =\mathbb{P}(X\vert Z) \mathbb{P}(Y\vert Z) # $$ # # which says that $X$ and $Y$ and independent conditioned # on $Z$. Conditioning # independent random variables can break # their independence. For example, consider # two independent # Bernoulli-distributed random variables, $X_1, X_2\in\lbrace 0,1 # \rbrace$. We define $Z=X_1+X_2$. Note that $Z\in \lbrace # 0,1,2 \rbrace$. In the # case where $Z=1$, we have, # # $$ # \begin{align*} # \mathbb{P}(X_1\vert Z=1) &>0 \\\ # \mathbb{P}(X_2\vert Z=1) &>0 # \end{align*} # $$ # # Even though $X_1,X_2$ are independent, # after conditioning on $Z$, we have the # following, # # $$ # \mathbb{P}(X_1=1,X_2=1\vert Z=1)=0\neq \mathbb{P}(X_1=1\vert # Z=1)\mathbb{P}(X_2=1\vert Z=1) # $$ # # Thus, conditioning on $Z$ breaks the independence of # $X_1,X_2$. This also works # in the opposite direction --- # conditioning can make dependent random variables # independent. # Define $Z_n=\sum_i^n X_i$ with $X_i$ independent, integer-valued # random variables. The $Z_n$ variables are # dependent because they stack the # same telescoping set of # $X_i$ variables. Consider the following, # # #
# # $$ # \begin{equation} # \mathbb{P}(Z_1=i,Z_3=j\vert Z_2=k) = # \frac{\mathbb{P}(Z_1=i,Z_2=k,Z_3=j)}{\mathbb{P}(Z_2 =k)} # \label{_auto1} \tag{1} # \end{equation} # $$ # # #
# # $$ # \begin{equation} \ # =\frac{\mathbb{P}(X_1 =i)\mathbb{P}(X_2 =k-i)\mathbb{P}(X_3 # =j-k) }{\mathbb{P}(Z_2 =k)} # \end{equation} # \label{eq:condIndep} \tag{2} # $$ # # where the factorization comes from the independence of # the $X_i$ variables. # Using the definition of conditional # probability, # # $$ # \mathbb{P}(Z_1=i\vert Z_2)=\frac{\mathbb{P}(Z_1=i,Z_2=k)}{\mathbb{P}(Z_2=k)} # $$ # # We can continue to expand Equation [2](#eq:condIndep), # # $$ # \begin{align*} # \mathbb{P}(Z_1=i,Z_3=j\vert Z_2=k) &=\mathbb{P}(Z_1 =i\vert # Z_2) \frac{\mathbb{P}( X_3 =j-k)\mathbb{P}( Z_2 =k)}{\mathbb{P}( Z_2 =k)}\\\ # &=\mathbb{P}(Z_1 =i\vert Z_2)\mathbb{P}(Z_3 =j\vert Z_2) # \end{align*} # $$ # # where $\mathbb{P}(X_3=j-k)\mathbb{P}(Z_2=k)= # \mathbb{P}(Z_3=j,Z_2)$. Thus, we # see that dependence between # random variables can be broken by conditioning to # create # conditionally independent random variables. As we have just # witnessed, # understanding how conditioning influences independence # is important and is the # main topic of # study in Probabilistic Graphical Models, a field # with many # algorithms and concepts to extract these # notions of conditional independence # from graph-based # representations of random variables. # # # ## Classic Broken Rod # Example # # Let's do one last example to exercise fluency in our methods by # considering the following classic problem: given a rod of unit-length, # broken # independently and randomly at two places, what is the # probability that you can # assemble the three remaining pieces into a # triangle? The first task is to find a # representation of a triangle as # an easy-to-apply constraint. What we want is # something like the # following: # # $$ # \mathbb{P}(\mbox{ triangle exists }) = \int_0^1 \int_0^1 \lbrace \mbox{ # triangle exists } \rbrace dX dY # $$ # # where $X$ and $Y$ are independent and uniformly distributed # in the unit- # interval. Heron's formula for the area of the triangle, # # $$ # \mbox{ area } = \sqrt{(s-a)(s-b)(s-c)s} # $$ # # where $s = (a+b+c)/2$ is what we need. The idea is that this # yields a valid # area only when each of the terms under the square root is # greater than or equal # to zero. Thus, suppose that we have # # $$ # \begin{eqnarray*} # a & = & X \\\ # b & = & Y-X \\\ # c & = & 1-Y # \end{eqnarray*} # $$ # # assuming that $Y>X$. Thus, the criterion for a valid triangle boils down # to # # $$ # \lbrace (s > a) \wedge (s > b) \wedge (s > c) \wedge (XX$. By symmetry, we get the same result for $X>Y$. Thus, the # final # result is the following: # # $$ # \mathbb{P}(\mbox{ triangle exists }) = \frac{1}{8}+\frac{1}{8} = \frac{1}{4} # $$ # # We can quickly check using this result using Python for the case $Y>X$ using # the # following code: # In[16]: import numpy as np x,y = np.random.rand(2,1000) # uniform rv a,b,c = x,(y-x),1-y # 3 sides s = (a+b+c)/2 np.mean((s>a) & (s>b) & (s>c) & (y>x)) # approx 1/8=0.125 # **Programming Tip.** # # The chained logical `&` symbols above tell Numpy that the # logical operation # should be considered element-wise.