Notebook

Logistic Regression with daru and statsample-glm¶

In this notebook we'll see with some examples how the probability of a given outcome can be predicted with logistic regression using daru and statsample-glm.

In [1]:

require 'daru'
require 'statsample-glm'
require 'open-uri'

Out[1]:

true

For this notebook, we will utilize this dataset denoting whether students got admission for a graduate degree program depending on their GRE scores, GPA and rank of the institute they did an undergraduate degree in (ranked from 1 to 4).

It should be noted that statsample-glm does not yet support categorical data so the ranks will be treated as continuos.

In [2]:

content = open('http://www.ats.ucla.edu/stat/data/binary.csv')
File.write('binary.csv', content.read)

df = Daru::DataFrame.from_csv "binary.csv"
df.vectors = Daru::Index.new([:admit, :gpa, :gre, :rank])
df

Out[2]:

Daru::DataFrame:27633020 rows: 400 cols: 4
	admit	gpa	gre	rank
0	0	3.61	380	3
1	1	3.67	660	3
2	1	4	800	1
3	1	3.19	640	4
4	0	2.93	520	4
5	1	3	760	2
6	1	2.98	560	1
7	0	3.08	400	2
8	1	3.39	540	3
9	0	3.92	700	2
10	0	4	800	4
11	0	3.22	440	1
12	1	4	760	1
13	0	3.08	700	2
14	1	4	700	1
15	0	3.44	480	3
16	0	3.87	780	4
17	0	2.56	360	3
18	0	3.75	800	2
19	1	3.81	540	1
20	0	3.17	500	3
21	1	3.63	660	2
22	0	2.82	600	4
23	0	3.19	680	4
24	1	3.35	760	2
25	1	3.66	800	1
26	1	3.61	620	1
27	1	3.74	520	4
28	1	3.22	780	2
29	0	3.29	520	1
30	0	3.78	540	4
31	0	3.35	760	3
...	...	...	...	...
399	0	3.89	600	3

Use the Statsampel::GLM.compute method for logisitic regression analysis.

The first method in the compute function is the DataFrame object, followed by the Vector that is to be the dependent variable, and then the method to be used for the link function. Can be :logit, :probit, :poisson or :normal.

The coefficients method calculates the coefficients of the GLM and returns them as a Hash.

In [3]:

glm = Statsample::GLM::compute df, :admit, :logistic, constant: 1
c = glm.coefficients :hash

Out[3]:

{:gpa=>0.777013573719857, :gre=>0.0022939595044433273, :rank=>-0.5600313868499897, :constant=>-3.4495483976684773}

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

Therefore, to interpret each of the above co-efficients:

For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.777.
For every increase in the rank number of the institute (aka decrease in quality of the institute), the log odds of being admitted to graduate school increase by -0.56.

Log odds become a little difficult to interpret, so we'll exponentiate each of the co-efficients so that each co-efficient can be interpreted as an odds-ratio.

In [4]:

Daru::Vector.new(c).exp # Calling `#exp` on Daru::Vector exponentiates each element of the Vector.

Out[4]:

Daru::Vector:17552980 size: 4
	nil
gpa	2.174967177712439
gre	1.0022965926425997
rank	0.571191135676971
constant	0.03175997601913591

We can now compute the probability of gaining admission into a graduate college based on the rank of the undergraduate college, by keeping the GRE score and GPA constant.

As you can see in the result below, the rankp Vector shows the probability of admission based on the rank. The person from the most highly rated undergrad school (rank 1) has a probability of 0.49 of getting admitted into graduate school.

In [5]:

e = Math::E
new_data = Daru::DataFrame.new({
  gre: [df[:gre].mean]*4,
  gpa: [df[:gpa].mean]*4,
  rank: df[:rank].factors
  })

new_data[:rankp] = new_data.collect(:row) do |x|
  1 / (1 + e ** -(c[:constant]  + x[:gre] * c[:gre] + x[:gpa] * c[:gpa] + x[:rank] * c[:rank]))
end

new_data.sort! [:rank]

Out[5]:

Daru::DataFrame:16947240 rows: 4 cols: 4
	gpa	gre	rank	rankp
1	3.3899000000000017	587.7	1	0.4931450619837156
3	3.3899000000000017	587.7	2	0.357219500353945
0	3.3899000000000017	587.7	3	0.240948896129993
2	3.3899000000000017	587.7	4	0.1534862275970381

To demonstrate with another example, lets create a hypothetical dataset consisting of the body weight of 20 people and whether they survived or not.

For this example we will just assume that people with less body weight have lesser chances of survival.

In [6]:

require 'distribution'

# Create a normally distributed Vector with mean 30 and standard deviation 2
rng = Distribution::Normal.rng(30,2)
body_weight = Daru::Vector.new(20.times.map { rng.call }.sort)

# Populate chances of survival, assume that people with less body weight on average
# are less likely to survive.
survive = Daru::Vector.new [0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1]

df = Daru::DataFrame.new({
  body_weight: body_weight,
  survive: survive
})

Out[6]:

Daru::DataFrame:27686700 rows: 20 cols: 2
	body_weight	survive
0	27.044364650712417	0
1	27.47159312927552	0
2	27.898166592007826	0
3	28.124290182202703	0
4	28.130559736750975	0
5	28.75865262632871	1
6	28.893891170932104	0
7	29.379537173488142	1
8	29.387455746614265	0
9	29.73011654672403	0
10	29.732865590281754	1
11	29.804600640385086	1
12	30.854286396908076	0
13	31.10670541344917	1
14	31.466802603305748	1
15	31.520641425410044	1
16	31.933197567214524	0
17	32.11397962791281	1
18	32.760606649719776	1
19	34.33739385108647	1

Compute the logistic regression co-efficients.

In [7]:

glm    = Statsample::GLM.compute df, :survive, :logistic, constant: 1
coeffs = glm.coefficients :hash

Out[7]:

{:body_weight=>0.8433486251123171, :constant=>-25.24920458377614}

Based on the coefficients, we compute the predicted probabilities for each number in the Vector :body_weight and store them in another Vector called :survive_pred.

In [8]:

e = Math::E
df[:survive_pred] = df[:body_weight].map { |x| 1 / (1 + e ** -(coeffs[:constant] + x*coeffs[:body_weight])) }
df

Out[8]:

Daru::DataFrame:27686700 rows: 20 cols: 3
	body_weight	survive	survive_pred
0	27.044364650712417	0	0.08007143558819431
1	27.47159312927552	0	0.11094995452363857
2	27.898166592007826	0	0.15170068399992506
3	28.124290182202703	0	0.17790253325703076
4	28.130559736750975	0	0.1786771529208482
5	28.75865262632871	1	0.26980060957631496
6	28.893891170932104	0	0.2928502245475736
7	29.379537173488142	1	0.38414006941637974
8	29.387455746614265	0	0.3857211724501716
9	29.73011654672403	0	0.456025989208083
10	29.732865590281754	1	0.4566011649897577
11	29.804600640385086	1	0.4716465476624143
12	30.854286396908076	0	0.6838918583579029
13	31.10670541344917	1	0.7280185490554567
14	31.466802603305748	1	0.7838559408058121
15	31.520641425410044	1	0.7914495278564925
16	31.933197567214524	0	0.843118090723654
17	32.11397962791281	1	0.8622465766953867
18	32.760606649719776	1	0.9152435218371247
19	34.33739385108647	1	0.9760883965278441

The above results can then be plotted using the plot function.

The curve looks is an ideal logit regression curve.

In [9]:

df.plot type: [:scatter,:line], x: [:body_weight]*2, y: [:survive_pred]*2 do |plot, diagram|
  plot.x_label "Body Weight"
  plot.y_label "Probability of Survival"
end