From the task description:
The competition task will be to develop a learning model based on the challenge and/or development data sets, use this algorithm to learn from the training portion of the challenge data sets, and then accurately predict student performance in the test sections.
Some of the technical challenges of this problem include:
The data matrix is sparse: not all students are given every problem, and some problems have only 1 or 2 students who completed each item. So, the contestants need to exploit relationships among problems to bring to bear enough data to hope to learn.
There is a strong temporal dimension to the data: students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items.
Which problems a given student sees is determined in part by student choices or past success history: e.g., students only see remedial problems if they are having trouble with the non-remedial problems. So, contestants need to pay attention to causal relationships in order to avoid selection bias.
I am not going to concern myself too much with the last aspect for now. The interactive tutorial system that students are using is suggestion remedial problems based on mistakes. The result could be that students are seeing more of a certain kind of problem. This could skew estimations of the student's total competency if this is not taking into account. But this will addressed later.
The first step in solving this problem is establishing the relationships between problems. To predict how well a student is going to perform against a new problem, we must first establish that problem in relation to the other problems in the database, and then to the problems within the database that the student has already encountered.
To establish the relationships between problems, we must use some kind of unsupervised machine learning technique.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Get the data: Algebra 2005-2006
train_filepath = 'data/algebra0506/algebra_2005_2006_train.txt'
test_filepath = 'data/algebra0506/algebra_2005_2006_test.txt'
traindata = pd.read_table(train_filepath)
# What does the training data look like?
traindata.head()
Row | Anon Student Id | Problem Hierarchy | Problem Name | Problem View | Step Name | Step Start Time | First Transaction Time | Correct Transaction Time | Step End Time | Step Duration (sec) | Correct Step Duration (sec) | Error Step Duration (sec) | Correct First Attempt | Incorrects | Hints | Corrects | KC(Default) | Opportunity(Default) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG4-FIXED | 1 | 3(x+2) = 15 | 2005-09-09 12:24:35.0 | 2005-09-09 12:24:49.0 | 2005-09-09 12:25:15.0 | 2005-09-09 12:25:15.0 | 40 | NaN | 40 | 0 | 2 | 3 | 1 | [SkillRule: Eliminate Parens; {CLT nested; CLT... | 1 |
1 | 2 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG4-FIXED | 1 | x+2 = 5 | 2005-09-09 12:25:15.0 | 2005-09-09 12:25:31.0 | 2005-09-09 12:25:31.0 | 2005-09-09 12:25:31.0 | 16 | 16 | NaN | 1 | 0 | 0 | 1 | [SkillRule: Remove constant; {ax+b=c, positive... | 1~~1 |
2 | 3 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG40 | 1 | 2-8y = -4 | 2005-09-09 12:25:36.0 | 2005-09-09 12:25:43.0 | 2005-09-09 12:26:12.0 | 2005-09-09 12:26:12.0 | 36 | NaN | 36 | 0 | 2 | 3 | 1 | [SkillRule: Remove constant; {ax+b=c, positive... | 2 |
3 | 4 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG40 | 1 | -8y = -6 | 2005-09-09 12:26:12.0 | 2005-09-09 12:26:34.0 | 2005-09-09 12:26:34.0 | 2005-09-09 12:26:34.0 | 22 | 22 | NaN | 1 | 0 | 0 | 1 | [SkillRule: Remove coefficient; {ax+b=c, divid... | 1~~1 |
4 | 5 | 0BrbPbwCMz | Unit ES_04, Section ES_04-1 | EG40 | 2 | -7y-5 = -4 | 2005-09-09 12:26:38.0 | 2005-09-09 12:28:36.0 | 2005-09-09 12:28:36.0 | 2005-09-09 12:28:36.0 | 118 | 118 | NaN | 1 | 0 | 0 | 1 | [SkillRule: Remove constant; {ax+b=c, positive... | 3~~1 |
5 rows × 19 columns
# Let's look at the columns
traindata.columns
Index([u'Row', u'Anon Student Id', u'Problem Hierarchy', u'Problem Name', u'Problem View', u'Step Name', u'Step Start Time', u'First Transaction Time', u'Correct Transaction Time', u'Step End Time', u'Step Duration (sec)', u'Correct Step Duration (sec)', u'Error Step Duration (sec)', u'Correct First Attempt', u'Incorrects', u'Hints', u'Corrects', u'KC(Default)', u'Opportunity(Default)'], dtype='object')
So the columns in the dataset pertaining to problem characteristics are 'KC(Default)' (the knowledge components of a problem), the 'Problem Name', the 'Problem Hierarchy', and the 'Step Name'. The problem is that 'Step Name' is unique within a problem but there may be collisions with other problems. When taken together with 'Problem Name', it is unique.
There may also be a need to generate more features if these prove insufficient for clustering analysis. Other features might include an estimated relative difficult based on how successful students are at solving the problem, the number of hints they require, and the time required to solve certain problems.
Perhaps the first thing to do is to establish a dictionary of knowledge components.
# Create empty list
KCs = []
# Grab the column of Knowledge Components, dropping all NaNs
KCcol = traindata['KC(Default)']
KCcol = list(KCcol.dropna())
# Loop over every database entry, read the skills, split on '~~' separator, and append to list
for i in range(len(KCcol)):
skills = KCcol[i].split('~~')
for skill in skills:
KCs.append(skill)
# Convert to set, which keeps only unique entries, then convert back to list
KCs = list(set(KCs))
# Print length
print 'The total number of unique skills is: ',len(KCs)
The total number of unique skills is: 112