This is a fun example. Suppose we wish to predict how many sign-ups there are on Github.com. Officially, Github does not release an up-to-date count, and at last offical annoucment (January 2013) the count was 3 million. What if we wish to measure it today? We could extrapolate future numbers from previous annoucements, but this uses little data and we could potentially be off by hundreds of thousands, and you are essentially just curve fitting complicated models.
Instead, what we are going to use is user id
numbers from real-time feeds on Github. The script github_events.py
will pull the most recent 300 events from the Github Public Timeline feed (we'll be accessing data using their API). From this, we pull out the user ids
associated with each event. We run the script below and display some output:
%run github_events.py
print "Some User ids from the latest events (push, star, fork etc.) on Github."
print ids[:10]
print
print "Number of unique ids found: ", ids.shape[0]
print "Largest user id: ", ids.max()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) c:\Python27\lib\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc) 169 else: 170 filename = fname --> 171 exec compile(scripttext, filename, 'exec') in glob, loc 172 else: 173 def execfile(fname, *where): C:\Users\Cameron\Dropbox\My Work\Probabilistic-Programming-and-Bayesian-Methods-for-Hackers\Chapter2_MorePyMC\github_events.py in <module>() 22 data = loads(r.text) 23 for event in data: ---> 24 ids[k] = ( event["actor"]["id"] ) 25 k+=1 26 KeyError: 'id'
Some User ids from the latest events (push, star, fork etc.) on Github. [1524995 1978503 1926860 1524995 3707208 374604 37715 770655 502701 4349707] Number of unique ids found: 300 Largest user id: 2085773151
figsize(12.5,3)
plt.hist( ids, bins = 45, alpha = 0.9)
plt.title("Histogram of %d Github User ids"%ids.shape[0] );
plt.xlabel("User id")
plt.ylabel("Frequency");
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-2-e83805e8eaea> in <module>() ----> 1 figsize(12.5,3) 2 plt.hist( ids, bins = 45, alpha = 0.9) 3 plt.title("Histogram of %d Github User ids"%ids.shape[0] ); 4 plt.xlabel("User id") 5 plt.ylabel("Frequency"); NameError: name 'figsize' is not defined
There are some users with multiple events, but we are only interested in unique user ids
, hence why we have less than 300 ids. Above I printed the largest user id
. Why is this important? If Github assigns user ids
serially, which is a fair assumption, then we know that there are certainly more users than that number. Remember, we are only looking at less than 300 individuals out of a much larger population, so it is unlikely that we would have sampled the most recent sign-up.
At best, we can only estimate the total number of sign-ups. Let's get more familar with this problem. Consider a fictional website that we wish to estimate the number of users:
Suppose we sampled only two individuals in a similar manner to above: the ids are 3 and 10 respectively. Would it be likely that the website has millions of users? Not very. Alternatively, it is more likely the website has less than 100 users.
On the other hand, if the ids were 3 and 34 989, we might be more willing to guess there could possibly thousands, or millions of user sign-ups. We are not very confident in an estimate, due to a lack of data.
If we sample thousands of users, and the maximum user id
is still 34 989, then is seems likely that the total number of sign ups is near 35 000. Hence our inference should be more confident.
We make the following assumption:
Assumption: Every user is equally likely to perform an event. Clearly, looking at the above histogram, this assumption is violated. The participation on Github is skewed towards early adopters, likely as these early-adopting individuals have a) more invested in Github, and b) saw the value earlier in Github, therefore are more interested in it. The distribution is also skewed towards new sign ups, who likely signed up just to push a project.
To create a Bayesian model of this is easy. Based on the above assumption, all user_ids
sampled are from a DiscreteUniform
model, with lower bound 1 and an unknown upperbound. We don't have a strong belief about what the upper-bound might be, but we do know it will be larger than ids.max()
.
Working with such large numbers can cause numerical problem, hence we will scale everything by a million. Thus, instead of a DiscreteUniform
, we will used a Uniform
:
FACTOR = 1000000.
import pymc as mc
upper_bound = mc.Uniform( "n_sign_ups", ids.max()/FACTOR, (ids.max())/FACTOR + 1)
obs = mc.Uniform("obs", 0, upper_bound, value = ids/FACTOR, observed = True )
#code to be examplained in Chp. 3.
mcmc = mc.MCMC([upper_bound, obs] )
mcmc.sample( 100000, 45000)
--------------------------------------------------------------------------- ZeroProbability Traceback (most recent call last) <ipython-input-3-4a6014825d95> in <module>() 4 5 upper_bound = mc.Uniform( "n_sign_ups", ids.max()/FACTOR, (ids.max())/FACTOR + 1) ----> 6 obs = mc.Uniform("obs", 0, upper_bound, value = ids/FACTOR, observed = True ) 7 8 #code to be examplained in Chp. 3. c:\Python27\lib\site-packages\pymc\distributions.pyc in __init__(self, *args, **kwds) 268 random = debug_wrapper(random) 269 else: --> 270 Stochastic.__init__(self, logp=logp, random=random, logp_partial_gradients = logp_partial_gradients, dtype=dtype, **arg_dict_out) 271 272 new_class.__name__ = name c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients) 714 if check_logp: 715 # Check initial value --> 716 if not isinstance(self.logp, float): 717 raise ValueError("Stochastic " + self.__name__ + "'s initial log-probability is %s, should be a float." %self.logp.__repr__()) 718 c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in get_logp(self) 846 raise ZeroProbability(self.errmsg + "\nValue: %s\nParents' values:%s" % (self._value, self._parents.value)) 847 else: --> 848 raise ZeroProbability(self.errmsg) 849 850 return logp ZeroProbability: Stochastic obs's value is outside its support, or it forbids its parents' current values.
from scipy.stats.mstats import mquantiles
samples = mcmc.trace("n_sign_ups")[:]
hist(samples, bins = 100,
label = "Uniform prior",
normed=True, alpha = 0.8,
histtype="stepfilled", color = "#7A68A6" );
quantiles_mean = np.append( mquantiles( samples, [0.05, 0.5, 0.95]), samples.mean() )
print "Quantiles: ", quantiles_mean[:3]
print "Mean: ", quantiles_mean[-1]
plt.vlines( quantiles_mean, 0, 33,
linewidth=2, linestyles = ["--", "--", "--", "-"],
)
plt.title("Posterior distribution of total number of Github users" )
plt.xlabel("number of users (in millions)")
plt.legend()
plt.xlim( ids.max()/FACTOR - 0.01, ids.max()/FACTOR + 0.12 );
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-2ca0106ec6de> in <module>() 1 from scipy.stats.mstats import mquantiles 2 ----> 3 samples = mcmc.trace("n_sign_ups")[:] 4 5 hist(samples, bins = 100, NameError: name 'mcmc' is not defined
Above we have plotted the posterior distribution. Note that there is no posterior probability assigned to the number of users being less than ids.max()
. That is good, as it would be an impossible situation.
The three dashed vertical bars, from left to right, are the 5%, 50% and 95% quantitle lines. That is, 5% of the probability is before the first line, 50% before the second and 95% before the third. The 50% quantitle is also know as the median and is a better measure of centrality than the mean for heavily skewed distributions like this one. The solid line is the posterior distribution's mean.
So what can we say? Using the data above, there is a 95% chance that there are less than 4.4 million users, and is probably around 4.36 million users. I was wondering how accurate this figure was. At the time of this writing, it seems a bit high considering only five months prior the number was at 3 million:
Last night @github crossed the 3M user mark #turntup
— Rick Bradley (@rickbradley) January 15, 2013
I thought perhaps the user_id
parameter was being used liberally to users/bots/changed names etc, so I contacted Github Support about it:
@cmrn_dp User IDs are assigned to new users/organizations, whether they’re controlled by humans, groups, or bots.
— GitHub Support (@GitHubHelp) May 6, 2013
So we may be overestimating by including organizations, which perhaps should not be counted as users. TODO: estimate the number of organizations. Any takers?