#!/usr/bin/env python
# coding: utf-8
# In[1]:
from IPython.display import Image
Image('../../Python_probability_statistics_machine_learning_2E.png',width=200)
# # Information Entropy
#
# We are in a position to discuss information entropy.
# This will give us a
# powerful perspective on how information passes between
# experiments, and will
# prove important in certain machine learning algorithms.
# There used to be a TV game show where the host would hide a prize behind
# one of
# three doors and the contestant would have to pick one of the doors.
# However,
# before opening the door of the contestant's choice, the host
# would open one of
# the other doors and ask the contestant if she wanted to
# change her selection.
# This is the classic *Monty Hall* problem. The
# question is should the contestant
# stay with her original choice or switch
# after seeing what the host has revealed?
# From the information theory
# perspective, does the information environment change
# when the host reveals
# what is behind one of the doors? The important detail
# here is that the
# host *never* opens the door with the prize behind it,
# regardless of the
# contestant's choice. That is, the host *knows* where the prize
# is, but he
# does not reveal that information directly to the contestant. This is
# the
# fundamental problem information theory addresses --- how to aggregate and
# reason about partial information. We need a concept of information that
# can
# accommodate this kind of question.
#
# ## Information Theory Concepts
#
# The Shannon
# *information content* of an outcome $x$ is defined as,
#
# $$
# h(x) = \log_2\frac{1}{P(x)}
# $$
#
# where $P(x)$ is the probability of $x$. The *entropy* of the ensemble
# $X$ is
# defined to be the Shannon information content of
#
# $$
# H(X) = \sum_x P(x) \log_2 \frac{1}{P(x)}
# $$
#
# It is no accident that the entropy has this functional form
# as the expectation
# of $h(x)$. It leads to a deep and powerful theory
# of information.
#
# To get some
# intuition about what information entropy means, consider a sequence
# of three-bit
# numbers where each individual bit is equally likely. Thus, the
# individual
# information content of a single bit is $h(x) = \log_2 (2) = 1$. The
# units of
# entropy are *bits* so this says that information content of a single bit
# is one
# bit. Because the three-bit number has elements that are mutually
# independent and
# equally likely, the information entropy of the
# three-bit number is $h(X) = 2^3
# \times \log_2(2^3)/8=3 $. Thus,
# the basic idea of information content at least
# makes sense at this level.
#
# A better way to interpret this question is as how
# much information would I have
# to provide in order to uniquely encode an
# arbitrary three-bit number? In this
# case, you would have to answer three
# questions: *Is the first bit zero or one?
# Is the second bit zero or one? Is the
# third bit zero or one?* Answering
# these questions uniquely specifies the unknown
# three-bit number. Because the
# bits are mutually independent, knowing the state
# of any of the bits does not
# inform the remainder.
#
# Next, let's consider a
# situation that lacks this mutual independence. Suppose
# in a group of nine
# otherwise identical balls there is a heavier one.
# Furthermore, we also have a
# measuring scale that indicates whether one side is
# heavier, lighter, or equal to
# the other. How could we identify the heavier
# ball? At the outset, the
# information content, which measures the uncertainty of
# the situation is
# $\log_2(9)$ because one of the nine balls is heavier.
# [Figure](#fig:Information_Entropy_001) shows
# one strategy. We could arbitrarily
# select out one of the balls (shown by the
# square), leaving the remaining eight
# to be balanced. The thick, black
# horizontal line indicates the scale. The items
# below and above this line
# indicate the counterbalanced sides of the scale.
#
#
#
#
#
# One heavy
# ball is hidden among eight identical balls. By weighing groups sequentially, we
# can determine the heavy ball.
#
#
#
#
#
# If we get lucky, the scale will report that
# the group of four walls on either
# side of the balance are equal in weight. This
# means that the ball that was
# omitted is the heavier one. This is indicated by
# the hashed left-pointing
# arrow. In this case, all the uncertainty has
# evaporated, and the *informational
# value* of that one weighing is equal to
# $\log_2(9)$. In other words, the scale
# has reduced the uncertainty to zero
# (i.e., found the heavy ball). On the other
# hand, the scale could report that the
# upper group of four balls is heavier
# (black, upward-pointing arrow) or lighter
# (gray, downward-pointing arrow). In
# this case, we cannot isolate the heavier
# ball until we perform all of the
# indicated weighings, moving from left-to-right.
# Specifically, the four balls on
# the heavier side have to be split by a
# subsequent weighing into two balls and
# then to one ball before the heavy ball
# can be identified. Thus, this process
# takes three weighings. The first one has
# information content $\log_2(9/8)$, the
# next has $\log_2(4)$, and the final one
# has $\log_2(2)$. Adding all these up
# sums to $\log_2(9)$. Thus, whether or not
# the heavier ball is isolated in the
# first weighing, the strategy consumes
# $\log_2(9)$ bits, as it must, to find the
# heavy ball.
#
#
#
#
#
# For this strategy, the balls are
# broken up into three groups of equal size and subsequently weighed.
#
#
#
#
#
# However,
# this is not the only strategy. [Figure](#fig:Information_Entropy_002)
# shows
# another. In this approach, the nine balls are split up into three groups
# of
# three balls apiece. Two groups are weighed. If they are of equal weight,
# then
# this means the heavier ball is in the group that was left out (dashed
# arrow).
# Then, this group is split into two groups, with one element left out.
# If the two
# balls on the scale weigh the same, then it means the excluded one is
# the heavy
# one. Otherwise, it is one of the balls on the scale. The same process
# follows if
# one of the initially weighed groups is heavier (black upward-facing
# arrow) or
# lighter (gray lower-facing arrow). As before the information content
# of the
# situation is $\log_2(9)$. The first weighing reduces the uncertainty of
# the
# situation by $\log_2(3)$ and the subsequent weighing reduces it by another
# $\log_2(3)$. As before, these sum to $\log_2(9)$, but here we only need two
# weighings whereas the first strategy in [Figure](#fig:Information_Entropy_001)
# takes
# an average of $1/9 + 3*8/9 \approx 2.78$ weighings, which is more than two
# from the second strategy in [Figure](#fig:Information_Entropy_002).
#
# Why does
# the second strategy use fewer weighings? To reduce weighings, we need
# each
# weighing to adjudicate equally probable situations as many times as
# possible.
# Choosing one of the nine balls at the outset (i.e, first strategy in
# [Figure](#fig:Information_Entropy_001)) does not do this because the
# probability
# of selecting the correct ball is $1/9$. This does not create a
# equiprobable
# situation in the process. The second strategy leaves an equally
# probable
# situation at every stage (see [Figure](#fig:Information_Entropy_002)), so it
# extracts the most information out of
# each weighing as possible. Thus, the
# information content tells us how many bits
# of information have to be resolved
# using *any* strategy (i.e., $\log_2(9)$ in
# this example). It also illuminates
# how to efficiently remove uncertainty;
# namely, by adjudicating equiprobable
# situations as many times as possible.
#
# ## Properties of Information Entropy
# Now that we have the flavor of the concepts, consider the following properties
# of the information entropy,
#
# $$
# H(X) \ge 0
# $$
#
# with equality if and only if $P(x)=1$ for exactly one $x$.
# Intuitively, this
# means that when just one of the items in the ensemble is
# known absolutely (i.e.,
# with $P(x)=1$), the uncertainty collapses to zero.
# Also note that entropy is
# maximized when $P$ is uniformly distributed across
# the elements of the ensemble.
# This is illustrated in [Figure](#fig:Information_Entropy_003) for the case of
# two outcomes. In other words,
# information entropy is maximized when the two
# conflicting alternatives are
# equally probable. This is the mathematical reason
# why using the scale in the
# last example to adjudicate equally probable
# situations was so useful for
# abbreviating the weighing process.
# In[4]:
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.pylab import subplots
import numpy as np
p = np.linspace(0.001,1-.001,50)
fig,ax=subplots()
#fig.set_size_inches((14,7))
_=ax.plot(p,p*np.log2(1/p)+(1-p)*np.log2(1/(1-p)),'k-')
_=ax.set_xlabel('$p$',fontsize=24)
_=ax.set_ylabel('$H(p)$',fontsize=24)
_=ax.grid()
fig.tight_layout()
fig.savefig('fig-probability/information_entropy_003.png')
#
#
#
#
# The information entropy is maximized
# when $p=1/2$.
#
#
#
#
#
# Most importantly, the concept of entropy
# extends jointly as follows,
#
# $$
# H(X,Y) = \sum_{x,y} P(x,y) \log_2 \frac{1}{P(x,y)}
# $$
#
# If and only if $X$ and $Y$ are independent, entropy becomes
# additive,
#
# $$
# H(X,Y) = H(X)+H(Y)
# $$
#
# ## Kullback-Leibler Divergence
#
# Notions of information entropy lead to notions
# of distance between probability
# distributions that will become important for
# machine learning methods. The
# Kullback-Leibler divergence between two
# probability distributions $P$ and $Q$
# that are defined over the same set is
# defined as,
#
# $$
# D_{KL}(P,Q) = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)}
# $$
#
# Note that $D_{KL}(P,Q) \ge 0$ with equality if and only if $P=Q$.
# Sometimes the
# Kullback-Leibler divergence is called the Kullback-Leibler
# distance, but it is
# not formally a distance metric because it is asymmetrical
# in $P$ and $Q$. The
# Kullback-Leibler divergence defines a relative entropy as
# the loss of
# information if $P$ is modeled in terms of $Q$. There is an
# intuitive way to
# interpret the Kullback-Leibler divergence and understand its
# lack of symmetry.
# Suppose we have a set of messages to transmit, each with a
# corresponding
# probability $\lbrace
# (x_1,P(x_1)),(x_2,P(x_2)),\ldots,(x_n,P(x_n)) \rbrace$.
# Based on what we know
# about information entropy, it makes sense to encode the
# length of the message
# by $\log_2 \frac{1}{p(x)}$ bits. This parsimonious
# strategy means that more
# frequent messages are encoded with fewer bits. Thus, we
# can rewrite the entropy
# of the situation as before,
#
# $$
# H(X) = \sum_{k} P(x_k) \log_2 \frac{1}{P(x_k)}
# $$
#
# Now, suppose we want to transmit the same set of messages, but with a
# different
# set of probability weights, $\lbrace
# (x_1,Q(x_1)),(x_2,Q(x_2)),\ldots,(x_n,Q(x_n)) \rbrace$. In this situation, we
# can define the cross-entropy as
#
# $$
# H_q(X) = \sum_{k} P(x_k) \log_2 \frac{1}{Q(x_k)}
# $$
#
# Note that only the purported length of the encoded message has
# changed, not the
# probability of that message. The difference between these two
# is the Kullback-
# Leibler divergence,
#
# $$
# D_{KL}(P,Q)=H_q(X)-H(X)=\sum_x P(x) \log_2 \frac{P(x)}{Q(x)}
# $$
#
# In this light, the Kullback-Leibler divergence is the average
# difference in the
# encoded lengths of the same set of messages under two
# different probability
# regimes. This should help explain the lack of symmetry of
# the Kullback-Leibler
# divergence --- left to themselves, $P$ and $Q$ would
# provide the optimal-length
# encodings separately, but there can be no necessary
# symmetry in how each regime
# would rate the informational value of each message
# ($Q(x_i)$ versus $P(x_i)$).
# Given that each encoding is optimal-length in its
# own regime means that it must
# therefore be at least sub-optimal in another,
# thus giving rise to the Kullback-
# Leibler divergence. In the case where the
# encoding length of all messages
# remains the same for the two regimes, then the
# Kullback-Leibler divergence is
# zero [^Mackay].
#
# [^Mackay]: The best, easy-to-understand presentation of this
# material is chapter
# four of Mackay's text
# [[mackay2003information]](#mackay2003information). Another good reference is
# chapter four of [[hastie2013elements]](#hastie2013elements).
#
#
# ## Cross-Entropy
# as Maximum Likelihood
#
#
#
# Reconsidering maximum
# likelihood from our statistics chapter in more
# general terms, we have
#
# $$
# \theta_{\texttt{ML}} =\argmax_{\theta} \sum_{i=1}^n \log
# p_{\texttt{model}}(x_i;\theta)
# $$
#
# where $p_{\texttt{model}}$ is the assumed underlying probability
# density
# function parameterized by $\theta$ for the $x_i$ data elements.
# Dividing the
# above summation by $n$ does not change the derived optimal values,
# but it allows
# us to rewrite this using the empirical density function for $x$
# as the
# following,
#
# $$
# \theta_{\texttt{ML}} = \argmax_{\theta} \mathbb{E}_{x\sim
# \hat{p}_{\texttt{data}}}(\log p_{\texttt{model}}(x_i;\theta))
# $$
#
# Note that we have the distinction between $p_{\texttt{data}}$ and
# $\hat{p}_{\texttt{data}}$ where the former is the unknown distribution of
# the data and the latter is the estimated distribution of the data we have on
# hand.
#
# The cross-entropy can be written as the following,
#
# $$
# D_{KL}(P,Q) = \mathbb{E}_{X\sim P}(\log P(x))-\mathbb{E}_{X\sim P}(\log Q(x))
# $$
#
# where $X\sim P$ means the random variable $X$ has distribution $P$.
# Thus, we
# have
#
# $$
# \theta_{\texttt{ML}} = \argmax_{\theta}
# D_{KL}(\hat{p}_{\texttt{data}}, p_{\texttt{model}})
# $$
#
# That is, we can interpret maximum likelihood as the cross-entropy
# between the
# $p_{\texttt{model}}$ and the $\hat{p}_{\texttt{data}}$
# distributions.
# The first term has nothing to do with the estimated $\theta$ so
# maximizing this
# is the same as minimizing the following,
#
# $$
# \mathbb{E}_{x\sim \hat{p}_{\texttt{data}}}(\log
# p_{\texttt{model}}(x_i;\theta))
# $$
#
# because information entropy is always non-negative. The important
# interpretation is that maximum likelihood is an attempt to choose $\theta$
# model
# parameters that make the empirical distribution of the data match the
# model
# distribution.