In this notebook we'll go over ways to group, split and pivot data using the DataFrame#group_by and and DataFrame#pivot_table functions.
require 'daru'
true
Lets first create a hierarchically indexed DataFrame.
# Create a multi-indexed DataFrame
tuples = [
[:a,:one,:bar],
[:a,:one,:baz],
[:a,:two,:bar],
[:a,:two,:baz],
[:b,:one,:bar],
[:b,:two,:bar],
[:b,:two,:baz],
[:b,:one,:foo],
[:c,:one,:bar],
[:c,:one,:baz],
[:c,:two,:foo],
[:c,:two,:bar]
]
multi_index = Daru::MultiIndex.from_tuples(tuples)
vector_arry1 = [11,12,13,14,11,12,13,14,11,12,13,14]
vector_arry2 = [1,2,3,4,1,2,3,4,1,2,3,4]
order_mi = Daru::MultiIndex.from_tuples([
[:a,:one,:bar],
[:a,:two,:baz],
[:b,:two,:foo],
[:b,:one,:foo]])
df_mi = Daru::DataFrame.new([
vector_arry1,
vector_arry2,
vector_arry1,
vector_arry2], order: order_mi, index: multi_index)
Daru::DataFrame:24162060 rows: 12 cols: 4 | ||||
---|---|---|---|---|
[:a, :one, :bar] | [:a, :two, :baz] | [:b, :two, :foo] | [:b, :one, :foo] | |
[:a, :one, :bar] | 11 | 1 | 11 | 1 |
[:a, :one, :baz] | 12 | 2 | 12 | 2 |
[:a, :two, :bar] | 13 | 3 | 13 | 3 |
[:a, :two, :baz] | 14 | 4 | 14 | 4 |
[:b, :one, :bar] | 11 | 1 | 11 | 1 |
[:b, :two, :bar] | 12 | 2 | 12 | 2 |
[:b, :two, :baz] | 13 | 3 | 13 | 3 |
[:b, :one, :foo] | 14 | 4 | 14 | 4 |
[:c, :one, :bar] | 11 | 1 | 11 | 1 |
[:c, :one, :baz] | 12 | 2 | 12 | 2 |
[:c, :two, :foo] | 13 | 3 | 13 | 3 |
[:c, :two, :bar] | 14 | 4 | 14 | 4 |
To select a row from a multi-indexed dataframe, you should pass the full tuple to the #row[]
method. Partial tuples will return whatever row that match the tuple partially.
# Specify complete tuple to choose a single row
df_mi.row[:a, :one,:bar]
Daru::Vector:22866540 size: 4 | |
---|---|
0 | |
[:a, :one, :bar] | 11 |
[:a, :two, :baz] | 1 |
[:b, :two, :foo] | 11 |
[:b, :one, :foo] | 1 |
# Specify partial tuple to select index hierarchially
df_mi.row[:a]
Daru::DataFrame:22500640 rows: 4 cols: 4 | ||||
---|---|---|---|---|
[:a, :one, :bar] | [:a, :two, :baz] | [:b, :two, :foo] | [:b, :one, :foo] | |
[:one, :bar] | 11 | 1 | 11 | 1 |
[:one, :baz] | 12 | 2 | 12 | 2 |
[:two, :bar] | 13 | 3 | 13 | 3 |
[:two, :baz] | 14 | 4 | 14 | 4 |
The DataFrame#group_by method allows you to group elements in the dataframe by name. This is similar to SQL GROUP BY.
# See grouped rows with the 'groups' method
df = Daru::DataFrame.new({
a: %w{foo bar foo bar foo bar foo foo},
b: %w{one one two three two two one three},
c: [1 ,2 ,3 ,1 ,3 ,6 ,3 ,8],
d: [11 ,22 ,33 ,44 ,55 ,66 ,77 ,88]
})
# Pass the vectors that are to be grouped in an Array to the group_by method. This
# will return a Daru::Core::GroupBy object.
grouped = df.group_by([:a, :b])
# See the groups created using the 'groups' method.
grouped.groups
{["bar", "one"]=>[1], ["bar", "three"]=>[3], ["bar", "two"]=>[5], ["foo", "one"]=>[0, 6], ["foo", "three"]=>[7], ["foo", "two"]=>[2, 4]}
# First group by the columns :a and :b and then calculate mean of the grouped rows.
grouped.mean
Daru::DataFrame:21484420 rows: 6 cols: 2 | ||
---|---|---|
c | d | |
["bar", "one"] | 2 | 22 |
["bar", "three"] | 1 | 44 |
["bar", "two"] | 6 | 66 |
["foo", "one"] | 2.0 | 44.0 |
["foo", "three"] | 8 | 88 |
["foo", "two"] | 3.0 | 44.0 |
The #get_group method can be used for accessing a particualar group(s).
grouped.get_group(["foo", "one"])
Daru::DataFrame:21046320 rows: 2 cols: 4 | ||||
---|---|---|---|---|
a | b | c | d | |
0 | foo | one | 1 | 11 |
6 | foo | one | 3 | 77 |
Similar to Excel's Pivot Table, DataFrame provides the #pivot_table functions for quickly pivoting data around a particular value(s) and getting quick insights.
Lets demonstrate using some sales data.
sales = Daru::DataFrame.from_csv 'data/sales-funnel.csv'
Daru::DataFrame:18812540 rows: 17 cols: 8 | ||||||||
---|---|---|---|---|---|---|---|---|
Account | Manager | Name | Price | Product | Quantity | Rep | Status | |
0 | 714466 | Debra Henley | Trantow-Barrows | 30000 | CPU | 1 | Craig Booker | presented |
1 | 714466 | Debra Henley | Trantow-Barrows | 10000 | Software | 1 | Craig Booker | presented |
2 | 714466 | Debra Henley | Trantow-Barrows | 5000 | Maintenance | 2 | Craig Booker | pending |
3 | 737550 | Debra Henley | Fritsch, Russel and Anderson | 35000 | CPU | 1 | Craig Booker | declined |
4 | 146832 | Debra Henley | Kiehn-Spinka | 65000 | CPU | 2 | Daniel Hilton | won |
5 | 218895 | Debra Henley | Kulas Inc | 40000 | CPU | 2 | Daniel Hilton | pending |
6 | 218895 | Debra Henley | Kulas Inc | 10000 | Software | 1 | Daniel Hilton | presented |
7 | 412290 | Debra Henley | Jerde-Hilpert | 5000 | Maintenance | 2 | John Smith | pending |
8 | 740150 | Debra Henley | Barton LLC | 35000 | CPU | 1 | John Smith | declined |
9 | 141962 | Fred Anderson | Herman LLC | 65000 | CPU | 2 | Cedric Moss | won |
10 | 163416 | Fred Anderson | Purdy-Kunde | 30000 | CPU | 1 | Cedric Moss | presented |
11 | 239344 | Fred Anderson | Stokes LLC | 5000 | Maintenance | 1 | Cedric Moss | pending |
12 | 239344 | Fred Anderson | Stokes LLC | 10000 | Software | 1 | Cedric Moss | presented |
13 | 307599 | Fred Anderson | Kassulke, Ondricka and Metz | 7000 | Maintenance | 3 | Wendy Yule | won |
14 | 688981 | Fred Anderson | Keeling LLC | 100000 | CPU | 5 | Wendy Yule | won |
15 | 729833 | Fred Anderson | Koepp Ltd | 65000 | CPU | 2 | Wendy Yule | declined |
16 | 729833 | Fred Anderson | Koepp Ltd | 5000 | Monitor | 2 | Wendy Yule | presented |
The #pivot_table method accepts an option :index, in which you can specify what vectors you want to index your DataFrame against.
sales.pivot_table index: ['Manager', 'Rep']
Daru::DataFrame:17258300 rows: 5 cols: 3 | |||
---|---|---|---|
Account | Price | Quantity | |
["Debra Henley", "Craig Booker"] | 720237.0 | 20000.0 | 1.25 |
["Debra Henley", "Daniel Hilton"] | 194874.0 | 38333.333333333336 | 1.6666666666666667 |
["Debra Henley", "John Smith"] | 576220.0 | 20000.0 | 1.5 |
["Fred Anderson", "Cedric Moss"] | 196016.5 | 27500.0 | 1.25 |
["Fred Anderson", "Wendy Yule"] | 614061.5 | 44250.0 | 3.0 |
You can also specify the :values
option for specifying which Vector is to be used for the values.
The :vectors
options lets you specify the columns to pivot against.
The :agg
option specifies the aggregation function. This can be any stats method like :mean, :median, :product, etc.
sales.pivot_table(index: ['Manager','Rep'], values: 'Price', vectors: ['Product'], agg: :sum)
Daru::DataFrame:15752920 rows: 5 cols: 12 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
["Account", "CPU"] | ["Account", "Software"] | ["Account", "Maintenance"] | ["Price", "CPU"] | ["Price", "Software"] | ["Price", "Maintenance"] | ["Quantity", "CPU"] | ["Quantity", "Software"] | ["Quantity", "Maintenance"] | ["Account", "Monitor"] | ["Price", "Monitor"] | ["Quantity", "Monitor"] | |
["Debra Henley", "Craig Booker"] | 1452016 | 714466 | 714466 | 65000 | 10000 | 5000 | 2 | 1 | 2 | |||
["Debra Henley", "Daniel Hilton"] | 365727 | 218895 | 105000 | 10000 | 4 | 1 | ||||||
["Debra Henley", "John Smith"] | 740150 | 412290 | 35000 | 5000 | 1 | 2 | ||||||
["Fred Anderson", "Cedric Moss"] | 305378 | 239344 | 239344 | 95000 | 10000 | 5000 | 3 | 1 | 1 | |||
["Fred Anderson", "Wendy Yule"] | 1418814 | 307599 | 165000 | 7000 | 7 | 3 | 729833 | 5000 | 2 |