#!/usr/bin/env python
# coding: utf-8
# In[1]:
from IPython.display import Image
Image('../../Python_probability_statistics_machine_learning_2E.png',width=200)
# # Useful Inequalities
#
# In practice, few quantities can be analytically
# calculated. Some knowledge
# of bounding inequalities helps find the ballpark for
# potential solutions. This
# sections discusses three key inequalities that are
# important for
# probability, statistics, and machine learning.
#
# ## Markov's
# Inequality
#
# Let $X$ be a non-negative random variable
# and suppose that
# $\mathbb{E}(X) < \infty$. Then,
# for any $t>0$,
#
# $$
# \mathbb{P}(X>t)\leq \frac{\mathbb{E}(X)}{t}
# $$
#
# This is a foundational inequality that is
# used as a stepping stone to other
# inequalities. It is easy
# to prove. Because $X>0$, we have the following,
#
# $$
# \begin{align*}
# \mathbb{E}(X)&=\int_0^\infty x f_x(x)dx =\underbrace{\int_0^t
# x f_x(x)dx}_{\text{omit this}}+\int_t^\infty x f_x(x)dx \\\
# &\ge\int_t^\infty x f_x(x)dx \ge t\int_t^\infty f_x(x)dx = t \mathbb{P}(X>t)
# \end{align*}
# $$
#
# The step that establishes the inequality is the part where the
# $\int_0^t x
# f_x(x)dx$ is omitted. For a particular $f_x(x)$ that may be
# concentrated around
# the $[0,t]$ interval, this could be a lot to throw out.
# For that reason, the
# Markov Inequality is considered a *loose* inequality,
# meaning that there is a
# substantial gap between both sides of the inequality.
# For example, as shown in
# [Figure](#fig:ProbabilityInequalities_001), the
# $\chi^2$ distribution has a lot
# of its mass on the left, which would be omitted
# in the Markov Inequality.
# [Figure](#fig:ProbabilityInequalities_002) shows
# the two curves established by
# the Markov Inequality. The gray shaded region is
# the gap between the two terms
# and indicates that looseness of the bound
# (fatter shaded region) for this case.
#
#
#
#
# The $\chi_1^2$ density has much
# of its weight on the left, which is excluded in the establishment of the Markov
# Inequality.
#
#
#
#
#
#
#
#
#
# The shaded area
# shows the region between the curves on either side of the Markov Inequality.
#
#
#
#
#
# ## Chebyshev's Inequality
#
# Chebyshev's Inequality drops out
# directly from the Markov Inequality. Let
# $\mu=\mathbb{E}(X)$ and
# $\sigma^2=\mathbb{V}(X)$. Then, we have
#
# $$
# \mathbb{P}(\vert X-\mu\vert \ge t) \le \frac{\sigma^2}{t^2}
# $$
#
# Note that if we normalize so that $Z=(X-\mu)/\sigma$, we
# have $\mathbb{P}(\vert
# Z\vert \ge k) \le 1/k^2$. In particular,
# $\mathbb{P}(\vert Z\vert \ge 2) \le
# 1/4$. We can illustrate this
# inequality using Sympy statistics module,
# In[2]:
import sympy
import sympy.stats as ss
t=sympy.symbols('t',real=True)
x=ss.ChiSquared('x',1)
# To get the left side of the Chebyshev inequality, we
# have to write this out as
# the following conditional probability,
# In[3]:
r = ss.P((x-1) > t,x>1)+ss.P(-(x-1) > t,x<1)
# We could take the above expression, which is a function of $t$ and
# attempt to
# compute the integral, but that would take a very long time (the
# expression is
# very long and complicated, which is why we did not print it out
# above). In this
# situation, it's better to use the built-in cumulative density
# function as in the
# following (after some rearrangement of the terms),
# In[4]:
w=(1-ss.cdf(x)(t+1))+ss.cdf(x)(1-t)
# To plot this, we can evaluated at a variety of `t` values by using
# the `.subs`
# substitution method, but it is more convenient to use the
# `lambdify` method to
# convert the expression to a function.
# In[5]:
fw=sympy.lambdify(t,w)
# Then, we can evaluate this function using something like
# In[6]:
[fw(i) for i in [0,1,2,3,4,5]]
# to produce the following [Figure](#fig:ProbabilityInequalities_003).
#
#
#
#
#
# The
# shaded area shows the region between the curves on either side of the Chebyshev
# Inequality.
#
#
#
#
#
# **Programming Tip.**
#
# Note that we cannot use
# vectorized inputs for the `lambdify` function because
# it contains embedded
# functions that are only available in Sympy. Otherwise, we
# could have used
# `lambdify(t,fw,numpy)` to specify the corresponding functions
# in Numpy to use
# for the expression.
#
#
#
# ## Hoeffding's Inequality
#
#
# Hoeffding's Inequality is similar, but less loose,
# than Markov's Inequality.
# Let $X_1,\ldots,X_n$ be iid observations such that
# $\mathbb{E}(X_i)=\mu$ and
# $a\le X_i \le b$. Then, for any $\epsilon>0$, we have
#
# $$
# \mathbb{P}(\vert \overline{X}_n -\mu\vert \ge \epsilon) \le 2 \exp(-2
# n\epsilon^2/(b-a)^2)
# $$
#
# where $\overline{X}_n = \tfrac{1}{n}\sum_i^n X_i$. Note that we
# further assume
# that the individual random variables are bounded.
#
# **Corollary.** If
# $X_1,\ldots,X_n$ are independent with $\mathbb{P}(a\le X_i\le b)=1$
# and all with
# $\mathbb{E}(X_i)=\mu$. Then, we have
#
# $$
# \vert\overline{X}_n-\mu\vert \le \sqrt{\frac{c}{2 n}\log \frac{2}{\delta}}
# $$
#
# where $c=(b-a)^2$. We will see this inequality again in the machine
# learning
# chapter. [Figure](#fig:ProbabilityInequalities_004) shows the Markov
# and
# Hoeffding bounds for the case of ten identically and uniformly distributed
# random variables, $X_i \sim \mathcal{U}[0,1]$. The solid line shows
# $\mathbb{P}(\vert \overline{X}_n - 1/2 \vert > \epsilon)$. Note that the
# Hoeffding Inequality is tighter than the Markov Inequality and that both of
# them
# merge when $\epsilon$ gets big enough.
#
#
#
#
# This shows the Markov and Hoeffding bounds for the case of ten identically
# and uniformly distributed random variables.
#
#
#
# ### Proof of Hoeffding's Inequality
#
# We will need the following lemma to prove
# Hoeffding's inequality.
#
# **Lemma** Let $X$ be a random variable with
# $\mathbb{E}(X)=0$ and
# $a\le X\le b$. Then, for any $s>0$, we have the following,
#
#
#
#
# $$
# \begin{equation}
# \mathbb{E}(e^{s X}) \le e^{s^2(b-a)^2/8}
# \label{_auto1} \tag{1}
# \end{equation}
# $$
#
# Because $X$ is contained in the closed interval $[a,b]$, we can write it as a
# convex
# combination of the endpoints of the interval.
#
# $$
# X = \alpha_1 a + \alpha_2 b
# $$
#
# where $\alpha_1+\alpha_2=1$. Solving for the $\alpha_i$ terms, we have
#
# $$
# \begin{align*}
# \alpha_1 = & \frac{x-a}{b-a} \\
# \alpha_2 = &
# \frac{b-x}{b-a}
# \end{align*}
# $$
#
# From Jensen's inequality, for a convex functions $f$, we know that
#
# $$
# f\left(\sum \alpha_i x_i\right) \le \sum \alpha_i f(x_i)
# $$
#
# Given the convexity of $e^X$, we therefore have,
#
# $$
# e^{s X} \le \alpha_1 e^{s a} + \alpha_2 e^ {s b}
# $$
#
# With $\mathbb{E}(X)=0$, we can write the expectation of both sides
#
# $$
# \mathbb{E}(e^{s X}) \le \mathbb{E}(\alpha_1) e^{s a}
# +\mathbb{E}(\alpha_2) e^{s b}
# $$
#
# with $\mathbb{E}(\alpha_1)=\frac{b}{b-a}$ and
# $\mathbb{E}(\alpha_2)=\frac{-a}{b-a}$. Thus, we have
#
# $$
# \mathbb{E}(e^{s X}) \le \frac{b}{b-a} e^{s a} -\frac{a}{b-a} e^{s b}
# $$
#
# Using $p:=\frac{-a}{b-a}$, we can rewrite the following,
#
# $$
# \frac{b}{b-a} e^{s a} -\frac{a}{b-a} e^{s b} = (1-p)e^{s a} + p e^{s b} =:
# e^{\phi(u)}
# $$
#
# where
#
# $$
# \phi(u)=-p u + \log(1-p+p e^{u})
# $$
#
# and $u=s(b-a)$. Note that $\phi(0)=\phi'(0)=0$. Also, $\phi''(0) = p(1-p)\le
# 1/4$. Thus,
# the Taylor expansion of $\phi(u)\approx \frac{u^2}{2}\phi''(t) \le
# \frac{u^2}{8}$ for
# $t\in [0,u] \blacksquare$.
#
# To prove Hoeffding's inequality,
# we start with Markov's inequality,
#
# $$
# \mathbb{P}(X\ge\epsilon)\le \frac{\mathbb{E}(X)}{\epsilon}
# $$
#
# Then, given $s>0$, we have the following,
#
# $$
# \mathbb{P}(X\ge\epsilon)=\mathbb{P}(e^{s X} \ge e^{s\epsilon}) \le
# \frac{\mathbb{E}(e^{s X})}{e^{s \epsilon}}
# $$
#
# We can write the one-sided Hoeffding inequality as the following,
#
# $$
# \begin{align*}
# \mathbb{P}(\overline{X}_n -\mu\ge\epsilon) & \le
# e^{-s\epsilon}\mathbb{E}(\exp(\frac{s}{n}\sum_{i=1}^n (X_i-\mathbb{E}(X_i)))) \\
# & = e^{-s\epsilon}\prod_{i=1}^n\mathbb{E}(e^{ \frac{s}{n} (X_i-\mathbb{E}(X_i))
# }) \\
# & \le e^{-s\epsilon}\prod_{i=1}^n e^{\frac{s^2}{n^2}(b-a)^2/8 } \\
# & = e^{-s\epsilon} e^{\frac{s^2}{n}(b-a)^2/8}
# \end{align*}
# $$
#
# Now, we want to pick $s>0$ to minimize this upper bound. Then, with
# $s=\frac{4
# n\epsilon}{(b-a)^2}$
#
# $$
# \mathbb{P}(\overline{X}_n-\mu\ge\epsilon)\le e^{-\frac{2
# n\epsilon^2}{(b-a)^2}}
# $$
#
# The other side of the inequality follows similarly to obtain Hoeffding's
# inequality $\blacksquare$.
#
# ## Jensen's Inequality
#
#
# If $f$ is a convex function with random variable
# $v$, then
#
# $$
# \mathbb{E}(f(v))\ge f(\mathbb{E}(v))
# $$
#
# The proof of this is straightforward. Define $L(v) = a v +b $ with
# $a,b\in
# \mathbb{R}$. Choose $a$ and $b$ so that
# $L(\mathbb{E}(v))=f(\mathbb{E}(v))$
# which makes $L$ tangent to $f$ at
# $\mathbb{E}(v)$. By the convexity of $f$, we
# have $f(v)\ge L(v)$. We can take
# the expectation of both sides of this,
#
# $$
# \begin{align*}
# \mathbb{E}(f(v)) \ge & \mathbb{E}(L(v)) \\
# =
# & \mathbb{E}(a v+b) \\
# = & a\mathbb{E}(v)+b \\
# = & L(\mathbb{E}(v)) \\
# = & f(\mathbb{E}(v))
# \end{align*}
# $$
#
# equality holds when $f$ is linear. For a concave function $f$, the
# sense of the
# inequality is reversed.
#
# In[ ]: