The aim of this IPython Notebook is to show how we can use Python to build predictive algorithms that solve data science problems in the arena of education.
This notebook is still heavily under construction
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# Get the data: Algebra 2005-2006
train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'
test_filepath = 'data/algebra0506/algebra_2005_2006_test.txt'
traindata = pd.read_table(train_filepath)
Some more information the data format can be found on the challenge website
# Inspect some of the training data
traindata.head()
Row | Anon Student Id | Problem Hierarchy | Problem Name | Problem View | Step Name | Step Start Time | First Transaction Time | Correct Transaction Time | Step End Time | Step Duration (sec) | Correct Step Duration (sec) | Error Step Duration (sec) | Correct First Attempt | Incorrects | Hints | Corrects | KC(Default) | Opportunity(Default) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG4-FIXED | 1 | 3(x+2) = 15 | 2005-09-09 12:24:35.0 | 2005-09-09 12:24:49.0 | 2005-09-09 12:25:15.0 | 2005-09-09 12:25:15.0 | 40 | NaN | 40 | 0 | 2 | 3 | 1 | [SkillRule: Eliminate Parens; {CLT nested; CLT... | 1 |
1 | 2 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG4-FIXED | 1 | x+2 = 5 | 2005-09-09 12:25:15.0 | 2005-09-09 12:25:31.0 | 2005-09-09 12:25:31.0 | 2005-09-09 12:25:31.0 | 16 | 16 | NaN | 1 | 0 | 0 | 1 | [SkillRule: Remove constant; {ax+b=c, positive... | 1~~1 |
2 | 3 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG40 | 1 | 2-8y = -4 | 2005-09-09 12:25:36.0 | 2005-09-09 12:25:43.0 | 2005-09-09 12:26:12.0 | 2005-09-09 12:26:12.0 | 36 | NaN | 36 | 0 | 2 | 3 | 1 | [SkillRule: Remove constant; {ax+b=c, positive... | 2 |
3 | 4 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG40 | 1 | -8y = -6 | 2005-09-09 12:26:12.0 | 2005-09-09 12:26:34.0 | 2005-09-09 12:26:34.0 | 2005-09-09 12:26:34.0 | 22 | 22 | NaN | 1 | 0 | 0 | 1 | [SkillRule: Remove coefficient; {ax+b=c, divid... | 1~~1 |
4 | 5 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG40 | 2 | -7y-5 = -4 | 2005-09-09 12:26:38.0 | 2005-09-09 12:28:36.0 | 2005-09-09 12:28:36.0 | 2005-09-09 12:28:36.0 | 118 | 118 | NaN | 1 | 0 | 0 | 1 | [SkillRule: Remove constant; {ax+b=c, positive... | 3~~1 |
5 rows × 19 columns
Let's begin asking some basic questions of the data
# Take the column of anonimized student IDs and count the number of unique entries
print 'Number of students: ', len(np.unique(traindata['Anon Student Id']))
Number of students: 574
csd = traindata['Correct Step Duration (sec)']
csd.describe()
count 620129.000000 mean 18.071478 std 34.796694 min 0.000000 25% 5.000000 50% 8.000000 75% 18.000000 max 1907.000000 Name: Correct Step Duration (sec), dtype: float64
So ignoring all the students that did not solve a problem step correctly, the average duraction for any problem step was about 18 seconds.
Let's histogram this data to see the distribution.
%matplotlib inline
hist = plt.hist(np.array(csd.dropna()),bins=100,normed=True,log=False,range=(0,100))
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Fraction')
plt.show()
counts, bins = hist[0], hist[1]
cdf = np.cumsum(counts)
plt.plot(bins[1::], cdf)
plt.xlabel('Time to correct answer (sec)')
plt.ylabel('Cumulative fraction')
plt.axis((0,100,0,1.0))
plt.show()
The histogram shows visually what mere statistics hints at. The distribution of students is heavily weighted towards those who are solving problems in under 20 seconds. The cumulative distribution function (CDF) shows that roughly 80% of successful students solve the problem within 20 seconds. After 40 seconds, 90% of successful students have finished the problem. Almost no students take longer than 80 seconds.
OK, let's ask a slightly harder question: how are students doing problem by problem? The answer will take several parts.
First, let's get the number of unique problems
# The unique identifier for each problem is the 'Problem Name'
problems = traindata['Problem Name']
# Get just the uniques
problems = np.unique(problems)
print 'Number of unique problems: ', len(problems)
Number of unique problems: 1084
pmedian_times = {}
for p in problems:
pmedian_times[p] = traindata[traindata['Problem Name'] == p]['Correct Step Duration (sec)'].median()
import operator
sorted_times = sorted(pmedian_times.iteritems(), key=operator.itemgetter(1), reverse=True)
traindata.columns
Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')
traindata['Step Name']
0 3(x+2) = 15 1 x+2 = 5 2 2-8y = -4 3 -8y = -6 4 -7y-5 = -4 5 -7y = 1 6 7y+4 = 7 7 7y = 3 8 -5+9y = -6 9 9y = -1 10 -7-3x = -2 11 -7-3x+7 = -2+7 12 -3x = 5 13 -3x/-3 = 5/-3 14 -9 = 8y+9 ... 809679 -4x = 5 809680 -4x/-4 = 5/-4 809681 x = 5/-4 809682 0 = -1y-10 809683 0+10 = -1y-10+10 809684 10 = -1y-10+10 809685 10 = -1y 809686 10/-1 = -1y/-1 809687 -10 = -1y/-1 809688 -7+2x = 4 809689 -7+2x+7 = 4+7 809690 2x = 4+7 809691 2x = 11 809692 2x/2 = 11/2 809693 -2+5x = 8 Name: Step Name, Length: 809694, dtype: object
traindata.columns
Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')