In this notebook we'll see with some examples how the probability of a given outcome can be predicted with logistic regression using daru and statsample-glm.
require 'daru'
require 'statsample-glm'
require 'open-uri'
true
For this notebook, we will utilize this dataset denoting whether students got admission for a graduate degree program depending on their GRE scores, GPA and rank of the institute they did an undergraduate degree in (ranked from 1 to 4).
It should be noted that statsample-glm does not yet support categorical data so the ranks will be treated as continuos.
content = open('http://www.ats.ucla.edu/stat/data/binary.csv')
File.write('binary.csv', content.read)
df = Daru::DataFrame.from_csv "binary.csv"
df.vectors = Daru::Index.new([:admit, :gpa, :gre, :rank])
df
Daru::DataFrame:27633020 rows: 400 cols: 4 | ||||
---|---|---|---|---|
admit | gpa | gre | rank | |
0 | 0 | 3.61 | 380 | 3 |
1 | 1 | 3.67 | 660 | 3 |
2 | 1 | 4 | 800 | 1 |
3 | 1 | 3.19 | 640 | 4 |
4 | 0 | 2.93 | 520 | 4 |
5 | 1 | 3 | 760 | 2 |
6 | 1 | 2.98 | 560 | 1 |
7 | 0 | 3.08 | 400 | 2 |
8 | 1 | 3.39 | 540 | 3 |
9 | 0 | 3.92 | 700 | 2 |
10 | 0 | 4 | 800 | 4 |
11 | 0 | 3.22 | 440 | 1 |
12 | 1 | 4 | 760 | 1 |
13 | 0 | 3.08 | 700 | 2 |
14 | 1 | 4 | 700 | 1 |
15 | 0 | 3.44 | 480 | 3 |
16 | 0 | 3.87 | 780 | 4 |
17 | 0 | 2.56 | 360 | 3 |
18 | 0 | 3.75 | 800 | 2 |
19 | 1 | 3.81 | 540 | 1 |
20 | 0 | 3.17 | 500 | 3 |
21 | 1 | 3.63 | 660 | 2 |
22 | 0 | 2.82 | 600 | 4 |
23 | 0 | 3.19 | 680 | 4 |
24 | 1 | 3.35 | 760 | 2 |
25 | 1 | 3.66 | 800 | 1 |
26 | 1 | 3.61 | 620 | 1 |
27 | 1 | 3.74 | 520 | 4 |
28 | 1 | 3.22 | 780 | 2 |
29 | 0 | 3.29 | 520 | 1 |
30 | 0 | 3.78 | 540 | 4 |
31 | 0 | 3.35 | 760 | 3 |
... | ... | ... | ... | ... |
399 | 0 | 3.89 | 600 | 3 |
Use the Statsampel::GLM.compute
method for logisitic regression analysis.
The first method in the compute
function is the DataFrame object, followed by the Vector that is to be the dependent variable, and then the method to be used for the link function. Can be :logit, :probit, :poisson or :normal.
The coefficients
method calculates the coefficients of the GLM and returns them as a Hash.
glm = Statsample::GLM::compute df, :admit, :logistic, constant: 1
c = glm.coefficients :hash
{:gpa=>0.777013573719857, :gre=>0.0022939595044433273, :rank=>-0.5600313868499897, :constant=>-3.4495483976684773}
The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
Therefore, to interpret each of the above co-efficients:
Log odds become a little difficult to interpret, so we'll exponentiate each of the co-efficients so that each co-efficient can be interpreted as an odds-ratio.
Daru::Vector.new(c).exp # Calling `#exp` on Daru::Vector exponentiates each element of the Vector.
Daru::Vector:17552980 size: 4 | |
---|---|
nil | |
gpa | 2.174967177712439 |
gre | 1.0022965926425997 |
rank | 0.571191135676971 |
constant | 0.03175997601913591 |
We can now compute the probability of gaining admission into a graduate college based on the rank of the undergraduate college, by keeping the GRE score and GPA constant.
As you can see in the result below, the rankp
Vector shows the probability of admission based on the rank. The person from the most highly rated undergrad school (rank 1) has a probability of 0.49 of getting admitted into graduate school.
e = Math::E
new_data = Daru::DataFrame.new({
gre: [df[:gre].mean]*4,
gpa: [df[:gpa].mean]*4,
rank: df[:rank].factors
})
new_data[:rankp] = new_data.collect(:row) do |x|
1 / (1 + e ** -(c[:constant] + x[:gre] * c[:gre] + x[:gpa] * c[:gpa] + x[:rank] * c[:rank]))
end
new_data.sort! [:rank]
Daru::DataFrame:16947240 rows: 4 cols: 4 | ||||
---|---|---|---|---|
gpa | gre | rank | rankp | |
1 | 3.3899000000000017 | 587.7 | 1 | 0.4931450619837156 |
3 | 3.3899000000000017 | 587.7 | 2 | 0.357219500353945 |
0 | 3.3899000000000017 | 587.7 | 3 | 0.240948896129993 |
2 | 3.3899000000000017 | 587.7 | 4 | 0.1534862275970381 |
To demonstrate with another example, lets create a hypothetical dataset consisting of the body weight of 20 people and whether they survived or not.
For this example we will just assume that people with less body weight have lesser chances of survival.
require 'distribution'
# Create a normally distributed Vector with mean 30 and standard deviation 2
rng = Distribution::Normal.rng(30,2)
body_weight = Daru::Vector.new(20.times.map { rng.call }.sort)
# Populate chances of survival, assume that people with less body weight on average
# are less likely to survive.
survive = Daru::Vector.new [0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1]
df = Daru::DataFrame.new({
body_weight: body_weight,
survive: survive
})
Daru::DataFrame:27686700 rows: 20 cols: 2 | ||
---|---|---|
body_weight | survive | |
0 | 27.044364650712417 | 0 |
1 | 27.47159312927552 | 0 |
2 | 27.898166592007826 | 0 |
3 | 28.124290182202703 | 0 |
4 | 28.130559736750975 | 0 |
5 | 28.75865262632871 | 1 |
6 | 28.893891170932104 | 0 |
7 | 29.379537173488142 | 1 |
8 | 29.387455746614265 | 0 |
9 | 29.73011654672403 | 0 |
10 | 29.732865590281754 | 1 |
11 | 29.804600640385086 | 1 |
12 | 30.854286396908076 | 0 |
13 | 31.10670541344917 | 1 |
14 | 31.466802603305748 | 1 |
15 | 31.520641425410044 | 1 |
16 | 31.933197567214524 | 0 |
17 | 32.11397962791281 | 1 |
18 | 32.760606649719776 | 1 |
19 | 34.33739385108647 | 1 |
Compute the logistic regression co-efficients.
glm = Statsample::GLM.compute df, :survive, :logistic, constant: 1
coeffs = glm.coefficients :hash
{:body_weight=>0.8433486251123171, :constant=>-25.24920458377614}
Based on the coefficients, we compute the predicted probabilities for each number in the Vector :body_weight and store them in another Vector called :survive_pred
.
e = Math::E
df[:survive_pred] = df[:body_weight].map { |x| 1 / (1 + e ** -(coeffs[:constant] + x*coeffs[:body_weight])) }
df
Daru::DataFrame:27686700 rows: 20 cols: 3 | |||
---|---|---|---|
body_weight | survive | survive_pred | |
0 | 27.044364650712417 | 0 | 0.08007143558819431 |
1 | 27.47159312927552 | 0 | 0.11094995452363857 |
2 | 27.898166592007826 | 0 | 0.15170068399992506 |
3 | 28.124290182202703 | 0 | 0.17790253325703076 |
4 | 28.130559736750975 | 0 | 0.1786771529208482 |
5 | 28.75865262632871 | 1 | 0.26980060957631496 |
6 | 28.893891170932104 | 0 | 0.2928502245475736 |
7 | 29.379537173488142 | 1 | 0.38414006941637974 |
8 | 29.387455746614265 | 0 | 0.3857211724501716 |
9 | 29.73011654672403 | 0 | 0.456025989208083 |
10 | 29.732865590281754 | 1 | 0.4566011649897577 |
11 | 29.804600640385086 | 1 | 0.4716465476624143 |
12 | 30.854286396908076 | 0 | 0.6838918583579029 |
13 | 31.10670541344917 | 1 | 0.7280185490554567 |
14 | 31.466802603305748 | 1 | 0.7838559408058121 |
15 | 31.520641425410044 | 1 | 0.7914495278564925 |
16 | 31.933197567214524 | 0 | 0.843118090723654 |
17 | 32.11397962791281 | 1 | 0.8622465766953867 |
18 | 32.760606649719776 | 1 | 0.9152435218371247 |
19 | 34.33739385108647 | 1 | 0.9760883965278441 |
The above results can then be plotted using the plot
function.
The curve looks is an ideal logit regression curve.
df.plot type: [:scatter,:line], x: [:body_weight]*2, y: [:survive_pred]*2 do |plot, diagram|
plot.x_label "Body Weight"
plot.y_label "Probability of Survival"
end