From Kaggle
In order to achieve this we have created a simulated data set with 200 variables and 20,000 cases. An ‘equation’ based on this data was created in order to generate a Target to be predicted. Given the all 20,000 cases, the problem is very easy to solve – but you only get given the Target value of 250 cases – the task is to build a model that gives the best predictions on the remaining 19,750 cases.
import gzip
import requests
import zipfile
url = ""
results = requests.get(url)
import StringIO
z = zipfile.ZipFile(StringIO.StringIO(results.content))
d ='overfitting.csv')
import numpy as np
M = np.fromstring(, sep="," )
data = np.loadtxt("overfitting.csv", delimiter=",", skiprows=1)
print """
There are also 5 other fields,
case_id - 1 to 20,000, a unique identifier for each row
train - 1/0, this is a flag for the first 250 rows which are the training dataset
Target_Practice - we have provided all 20,000 Targets for this model, so you can develop your method completely off line.
Target_Leaderboard - only 250 Targets are provided. You submit your predictions for the remaining 19,750 to the Kaggle leaderboard.
Target_Evaluate - again only 250 Targets are provided. Those competitors who beat the 'benchmark' on the Leaderboard will be asked to make one further submission for the Evaluation model.
There are also 5 other fields, case_id - 1 to 20,000, a unique identifier for each row train - 1/0, this is a flag for the first 250 rows which are the training dataset Target_Practice - we have provided all 20,000 Targets for this model, so you can develop your method completely off line. Target_Leaderboard - only 250 Targets are provided. You submit your predictions for the remaining 19,750 to the Kaggle leaderboard. Target_Evaluate - again only 250 Targets are provided. Those competitors who beat the 'benchmark' on the Leaderboard will be asked to make one further submission for the Evaluation model.
(20000L, 205L)
ix_training = data[:,1] == 1
ix_testing = data[:,1] == 0
training_data = data[ ix_training, 5: ]
testing_data = data[ ix_testing, 5: ]
training_labels = data[ ix_training, 2]
testing_labels = data[ ix_testing, 2]
print "training:", training_data.shape, training_labels.shape
print "testing: ", testing_data.shape, testing_labels.shape
training: (250L, 200L) (250L,) testing: (19750L, 200L) (19750L,)
He mentions that the X variables are from a Unifrom distribution. Let's investigate this:
figsize( 12, 4 )
hist( training_data.flatten() )
print training_data.shape[0]*training_data.shape[1]
looks pretty right
import pymc as mc
to_include = mc.Bernoulli( "to_include", 0.5, size= 200 )
coef = mc.Uniform( "coefs", 0, 1, size = 200 )
def Z( coef = coef, to_include = to_include, data = training_data ):
ym = to_include*training_data, coef )
return ym - ym.mean()
def T( z = Z ):
return 0.45*(np.sign(z) + 1.1)
obs = mc.Bernoulli( "obs", T, value = training_labels, observed = True)
model = mc.Model( [to_include, coef, Z, T, obs] )
map_ = mc.MAP( model )
Warning: Stochastic to_include's value is neither numerical nor array with floating-point dtype. Recommend fitting method fmin (default).
mcmc = mc.MCMC( model )
mcmc.sample(100000, 90000,1)
[****************100%******************] 100000 of 100000 complete
(np.round(T.value) == training_labels ).mean()
t_trace = mcmc.trace("T")[:]
(np.round( t_trace[-500:-400,:]).mean(axis=0) == training_labels ).mean()
t_mean = np.round( t_trace).mean(axis=1)
imshow(t_trace[-10000:,:], aspect="auto")
<matplotlib.colorbar.Colorbar instance at 0x0000000013270208>
figsize( 23, 8)
coef_trace = mcmc.trace("coefs")[:]
imshow(coef_trace[-10000:,:], aspect="auto",, interpolation="none")
<matplotlib.image.AxesImage at 0x19ce2780>
include_trace = mcmc.trace("to_include")[:]
figsize( 23, 8)
imshow(include_trace[-10000:,:], aspect="auto", interpolation="none")
<matplotlib.image.AxesImage at 0x18d8ef60>