Grouping, Splitting and Pivoting Data¶

In this notebook we'll go over ways to group, split and pivot data using the DataFrame#group_by and and DataFrame#pivot_table functions.

In [1]:

require 'daru'

Out[1]:

true

Lets first create a hierarchically indexed DataFrame.

In [2]:

# Create a multi-indexed DataFrame

tuples = [
  [:a,:one,:bar],
  [:a,:one,:baz],
  [:a,:two,:bar],
  [:a,:two,:baz],
  [:b,:one,:bar],
  [:b,:two,:bar],
  [:b,:two,:baz],
  [:b,:one,:foo],
  [:c,:one,:bar],
  [:c,:one,:baz],
  [:c,:two,:foo],
  [:c,:two,:bar]
]
multi_index = Daru::MultiIndex.from_tuples(tuples)

vector_arry1 = [11,12,13,14,11,12,13,14,11,12,13,14]
vector_arry2 = [1,2,3,4,1,2,3,4,1,2,3,4]

order_mi = Daru::MultiIndex.from_tuples([
  [:a,:one,:bar],
  [:a,:two,:baz],
  [:b,:two,:foo],
  [:b,:one,:foo]])

df_mi = Daru::DataFrame.new([
  vector_arry1, 
  vector_arry2, 
  vector_arry1, 
  vector_arry2], order: order_mi, index: multi_index)

Out[2]:

Daru::DataFrame:24162060 rows: 12 cols: 4
	[:a, :one, :bar]	[:a, :two, :baz]	[:b, :two, :foo]	[:b, :one, :foo]
[:a, :one, :bar]	11	1	11	1
[:a, :one, :baz]	12	2	12	2
[:a, :two, :bar]	13	3	13	3
[:a, :two, :baz]	14	4	14	4
[:b, :one, :bar]	11	1	11	1
[:b, :two, :bar]	12	2	12	2
[:b, :two, :baz]	13	3	13	3
[:b, :one, :foo]	14	4	14	4
[:c, :one, :bar]	11	1	11	1
[:c, :one, :baz]	12	2	12	2
[:c, :two, :foo]	13	3	13	3
[:c, :two, :bar]	14	4	14	4

To select a row from a multi-indexed dataframe, you should pass the full tuple to the #row[] method. Partial tuples will return whatever row that match the tuple partially.

In [3]:

# Specify complete tuple to choose a single row
df_mi.row[:a, :one,:bar]

Out[3]:

Daru::Vector:22866540 size: 4
	0
[:a, :one, :bar]	11
[:a, :two, :baz]	1
[:b, :two, :foo]	11
[:b, :one, :foo]	1

In [4]:

# Specify partial tuple to select index hierarchially
df_mi.row[:a]

Out[4]:

Daru::DataFrame:22500640 rows: 4 cols: 4
	[:a, :one, :bar]	[:a, :two, :baz]	[:b, :two, :foo]	[:b, :one, :foo]
[:one, :bar]	11	1	11	1
[:one, :baz]	12	2	12	2
[:two, :bar]	13	3	13	3
[:two, :baz]	14	4	14	4

Grouping¶

The DataFrame#group_by method allows you to group elements in the dataframe by name. This is similar to SQL GROUP BY.

In [5]:

# See grouped rows with the 'groups' method

df = Daru::DataFrame.new({
  a: %w{foo bar foo bar   foo bar foo foo},
  b: %w{one one two three two two one three},
  c:   [1  ,2  ,3  ,1    ,3  ,6  ,3  ,8],
  d:   [11 ,22 ,33 ,44   ,55 ,66 ,77 ,88]
})

# Pass the vectors that are to be grouped in an Array to the group_by method. This 
# will return a Daru::Core::GroupBy object.
grouped = df.group_by([:a, :b])

# See the groups created using the 'groups' method.
grouped.groups

Out[5]:

{["bar", "one"]=>[1], ["bar", "three"]=>[3], ["bar", "two"]=>[5], ["foo", "one"]=>[0, 6], ["foo", "three"]=>[7], ["foo", "two"]=>[2, 4]}

In [6]:

# First group by the columns :a and :b and then calculate mean of the grouped rows.
grouped.mean

Out[6]:

Daru::DataFrame:21484420 rows: 6 cols: 2
	c	d
["bar", "one"]	2	22
["bar", "three"]	1	44
["bar", "two"]	6	66
["foo", "one"]	2.0	44.0
["foo", "three"]	8	88
["foo", "two"]	3.0	44.0

The #get_group method can be used for accessing a particualar group(s).

In [7]:

grouped.get_group(["foo", "one"])

Out[7]:

Daru::DataFrame:21046320 rows: 2 cols: 4
	a	b	c	d
0	foo	one	1	11
6	foo	one	3	77

Pivoting Data¶

Similar to Excel's Pivot Table, DataFrame provides the #pivot_table functions for quickly pivoting data around a particular value(s) and getting quick insights.

Lets demonstrate using some sales data.

In [8]:

sales = Daru::DataFrame.from_csv 'data/sales-funnel.csv'

Out[8]:

Daru::DataFrame:18812540 rows: 17 cols: 8
	Account	Manager	Name	Price	Product	Quantity	Rep	Status
0	714466	Debra Henley	Trantow-Barrows	30000	CPU	1	Craig Booker	presented
1	714466	Debra Henley	Trantow-Barrows	10000	Software	1	Craig Booker	presented
2	714466	Debra Henley	Trantow-Barrows	5000	Maintenance	2	Craig Booker	pending
3	737550	Debra Henley	Fritsch, Russel and Anderson	35000	CPU	1	Craig Booker	declined
4	146832	Debra Henley	Kiehn-Spinka	65000	CPU	2	Daniel Hilton	won
5	218895	Debra Henley	Kulas Inc	40000	CPU	2	Daniel Hilton	pending
6	218895	Debra Henley	Kulas Inc	10000	Software	1	Daniel Hilton	presented
7	412290	Debra Henley	Jerde-Hilpert	5000	Maintenance	2	John Smith	pending
8	740150	Debra Henley	Barton LLC	35000	CPU	1	John Smith	declined
9	141962	Fred Anderson	Herman LLC	65000	CPU	2	Cedric Moss	won
10	163416	Fred Anderson	Purdy-Kunde	30000	CPU	1	Cedric Moss	presented
11	239344	Fred Anderson	Stokes LLC	5000	Maintenance	1	Cedric Moss	pending
12	239344	Fred Anderson	Stokes LLC	10000	Software	1	Cedric Moss	presented
13	307599	Fred Anderson	Kassulke, Ondricka and Metz	7000	Maintenance	3	Wendy Yule	won
14	688981	Fred Anderson	Keeling LLC	100000	CPU	5	Wendy Yule	won
15	729833	Fred Anderson	Koepp Ltd	65000	CPU	2	Wendy Yule	declined
16	729833	Fred Anderson	Koepp Ltd	5000	Monitor	2	Wendy Yule	presented

The #pivot_table method accepts an option :index, in which you can specify what vectors you want to index your DataFrame against.

In [9]:

sales.pivot_table index: ['Manager', 'Rep']

Out[9]:

Daru::DataFrame:17258300 rows: 5 cols: 3
	Account	Price	Quantity
["Debra Henley", "Craig Booker"]	720237.0	20000.0	1.25
["Debra Henley", "Daniel Hilton"]	194874.0	38333.333333333336	1.6666666666666667
["Debra Henley", "John Smith"]	576220.0	20000.0	1.5
["Fred Anderson", "Cedric Moss"]	196016.5	27500.0	1.25
["Fred Anderson", "Wendy Yule"]	614061.5	44250.0	3.0

You can also specify the :values option for specifying which Vector is to be used for the values.

The :vectors options lets you specify the columns to pivot against.

The :agg option specifies the aggregation function. This can be any stats method like :mean, :median, :product, etc.

In [10]:

sales.pivot_table(index: ['Manager','Rep'], values: 'Price', vectors: ['Product'], agg: :sum)

Out[10]:

Daru::DataFrame:15752920 rows: 5 cols: 12
	["Account", "CPU"]	["Account", "Software"]	["Account", "Maintenance"]	["Price", "CPU"]	["Price", "Software"]	["Price", "Maintenance"]	["Quantity", "CPU"]	["Quantity", "Software"]	["Quantity", "Maintenance"]	["Account", "Monitor"]	["Price", "Monitor"]	["Quantity", "Monitor"]
["Debra Henley", "Craig Booker"]	1452016	714466	714466	65000	10000	5000	2	1	2
["Debra Henley", "Daniel Hilton"]	365727	218895		105000	10000		4	1
["Debra Henley", "John Smith"]	740150		412290	35000		5000	1		2
["Fred Anderson", "Cedric Moss"]	305378	239344	239344	95000	10000	5000	3	1	1
["Fred Anderson", "Wendy Yule"]	1418814		307599	165000		7000	7		3	729833	5000	2