require 'daru'
require 'distribution'
require 'gnuplotrb'
true
Vectors are indexed by passing data using the index
option, and named with name
vector = Daru::Vector.new(
[20,40,25,50,45,12], index: ['cherry', 'apple', 'barley', 'wheat', 'rice', 'sugar'],
name: "Prices of stuff.")
Daru::Vector:15070620 size: 6 | |
---|---|
Prices of stuff. | |
cherry | 20 |
apple | 40 |
barley | 25 |
wheat | 50 |
rice | 45 |
sugar | 12 |
Specify the index you want to retrieve in the #[]
operator
vector['rice']
45
Multiple values can be retreived at the same time as another Daru::Vector by separating them with commas.
vector['rice', 'wheat', 'sugar']
Daru::Vector:14387920 size: 3 | |
---|---|
Prices of stuff. | |
rice | 45 |
wheat | 50 |
sugar | 12 |
Specifying a range of indexes will retrieve a slice of the Daru::Vector
vector['barley'..'sugar']
Daru::Vector:14063700 size: 4 | |
---|---|
Prices of stuff. | |
barley | 25 |
wheat | 50 |
rice | 45 |
sugar | 12 |
Assign a value by specifying the index directly to the #[]= operator
vector['barley'] = 1500
vector
Daru::Vector:15070620 size: 6 | |
---|---|
Prices of stuff. | |
cherry | 20 |
apple | 40 |
barley | 1500 |
wheat | 50 |
rice | 45 |
sugar | 12 |
The :index
option is used for specifying the row index of the DataFrame and the :order
option determines the order in which they will be stored.
Note that this is only one way of creating a DataFrame. There are around 8 different ways you can do so, depending on your use case.
df = Daru::DataFrame.new({
'col0' => [1,2,3,4,5,6],
'col2' => ['a','b','c','d','e','f'],
'col1' => [11,22,33,44,55,66]
},
index: ['one', 'two', 'three', 'four', 'five', 'six'],
order: ['col0', 'col1', 'col2']
)
Daru::DataFrame:13337740 rows: 6 cols: 3 | |||
---|---|---|---|
col0 | col1 | col2 | |
one | 1 | 11 | a |
two | 2 | 22 | b |
three | 3 | 33 | c |
four | 4 | 44 | d |
five | 5 | 55 | e |
six | 6 | 66 | f |
A DataFrame column can be accessed using the DataFrame#[] operator.
Note that it returns a Daru::Vector
df['col1']
Daru::Vector:13292960 size: 6 | |
---|---|
col1 | |
one | 11 |
two | 22 |
three | 33 |
four | 44 |
five | 55 |
six | 66 |
Multiple columns can be accessed by separating them with a comma. The result is another DataFrame.
df['col2', 'col0']
Daru::DataFrame:12423020 rows: 6 cols: 2 | ||
---|---|---|
col2 | col0 | |
one | a | 1 |
two | b | 2 |
three | c | 3 |
four | d | 4 |
five | e | 5 |
six | f | 6 |
A slice of the DataFrame by columns can be obtained by specifying a Range in #[]
df['col1'..'col2']
Daru::DataFrame:12007160 rows: 6 cols: 2 | ||
---|---|---|
col1 | col2 | |
one | 11 | a |
two | 22 | b |
three | 33 | c |
four | 44 | d |
five | 55 | e |
six | 66 | f |
You can assign a Daru::Vector to a column and the indexes of the Vector will be automatically matched to that of the DataFrame.
df['col1'] = Daru::Vector.new(['this', 'is', 'some','new','data','here'],
index: ['one', 'three','two','six','four', 'five'])
df
Daru::DataFrame:13337740 rows: 6 cols: 3 | |||
---|---|---|---|
col0 | col1 | col2 | |
one | 1 | this | a |
two | 2 | some | b |
three | 3 | is | c |
four | 4 | data | d |
five | 5 | here | e |
six | 6 | new | f |
A single row can be accessed using the #row[]
function.
df.row['four']
Daru::Vector:11115780 size: 3 | |
---|---|
four | |
col0 | 4 |
col1 | data |
col2 | d |
Specifying a Range of Row indexes in #row[]
will select a DataFrame with those rows
df.row['three'..'five']
Daru::DataFrame:9135240 rows: 3 cols: 3 | |||
---|---|---|---|
col0 | col1 | col2 | |
three | 3 | is | c |
four | 4 | data | d |
five | 5 | here | e |
You can also assign a Row with Daru::Vector. Notice that indexes are mathced according to the order of the DataFrame.
df.row['five'] = [666,555,333]
[666, 555, 333]
A host of static and rolling statistics methods are provided on Daru::Vector.
Note that missing data (very common in most real world scenarios) is gracefully handled
vector = Daru::Vector.new([1,3,5,nil,2,53,nil])
vector.mean
12.8
DataFrame statistics will basically apply the concerned method on all numerical columns of the DataFrame.
df.mean
Daru::Vector:8060380 size: 1 | |
---|---|
mean | |
col0 | 113.66666666666667 |
Useful statistics about the vectors in a DataFrame can be observed with #describe
df.describe
Daru::DataFrame:7470980 rows: 5 cols: 1 | |
---|---|
col0 | |
count | 6 |
mean | 113.66666666666667 |
std | 270.5924364550249 |
min | 1 |
max | 666 |
Daru offers a robust time series manipulation API for indexing data based on timestamps. This makes daru a viable tool for analyzing financial data (or any data that changes with time)
The DateTimeIndex is a special index for indexing data based on timestamps.
A date index range can be created using the DateTimeIndex.date_range function. The :freq
option decides the time frequency between each timestamp in the date index.
index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 1000, :freq => '3D')
#<DateTimeIndex:6151760 offset=3D periods=1000 data=[2012-01-01T00:00:00+00:00...2020-03-16T00:00:00+00:00]>
A Daru::Vector can be created by simply passing the newly created index object into the :index
argument.
timeseries = Daru::Vector.new(1000.times.map {rand}, index: index)
Daru::Vector:5628020 size: 1000 | |
---|---|
nil | |
2012-01-01T00:00:00+00:00 | 0.692831672574459 |
2012-01-04T00:00:00+00:00 | 0.6971783281963972 |
2012-01-07T00:00:00+00:00 | 0.34687766698487965 |
2012-01-10T00:00:00+00:00 | 0.5509404993547384 |
2012-01-13T00:00:00+00:00 | 0.10166975999865946 |
2012-01-16T00:00:00+00:00 | 0.34183413903843207 |
2012-01-19T00:00:00+00:00 | 0.018428168123970967 |
2012-01-22T00:00:00+00:00 | 0.7792652522504137 |
2012-01-25T00:00:00+00:00 | 0.24793667731961144 |
2012-01-28T00:00:00+00:00 | 0.7200752551979407 |
2012-01-31T00:00:00+00:00 | 0.770756064084555 |
2012-02-03T00:00:00+00:00 | 0.6475396341969668 |
2012-02-06T00:00:00+00:00 | 0.00034544180080875453 |
2012-02-09T00:00:00+00:00 | 0.9881939271758362 |
2012-02-12T00:00:00+00:00 | 0.042428559674003274 |
2012-02-15T00:00:00+00:00 | 0.6604582692043693 |
2012-02-18T00:00:00+00:00 | 0.6446959879056338 |
2012-02-21T00:00:00+00:00 | 0.11606340772777746 |
2012-02-24T00:00:00+00:00 | 0.5238981665473298 |
2012-02-27T00:00:00+00:00 | 0.25979569124671453 |
2012-03-01T00:00:00+00:00 | 0.1808967702663009 |
2012-03-04T00:00:00+00:00 | 0.04614156947957693 |
2012-03-07T00:00:00+00:00 | 0.8935716437439504 |
2012-03-10T00:00:00+00:00 | 0.7197074871013468 |
2012-03-13T00:00:00+00:00 | 0.20741375904156445 |
2012-03-16T00:00:00+00:00 | 0.501647901862296 |
2012-03-19T00:00:00+00:00 | 0.9470421480253584 |
2012-03-22T00:00:00+00:00 | 0.2954430257659184 |
2012-03-25T00:00:00+00:00 | 0.18422816661946229 |
2012-03-28T00:00:00+00:00 | 0.48737285121462925 |
2012-03-31T00:00:00+00:00 | 0.7549290269495055 |
2012-04-03T00:00:00+00:00 | 0.8216050188191338 |
... | ... |
2020-03-16T00:00:00+00:00 | 0.8324422863437039 |
When a Vector or DataFrame is indexed by a DateTimeIndex, it allows you to partially specify the date to retreive all the data that belongs to that date.
For example, to access all the data belonging to the year 2012.
timeseries['2012']
Daru::Vector:15406520 size: 122 | |
---|---|
nil | |
2012-01-01T00:00:00+00:00 | 0.692831672574459 |
2012-01-04T00:00:00+00:00 | 0.6971783281963972 |
2012-01-07T00:00:00+00:00 | 0.34687766698487965 |
2012-01-10T00:00:00+00:00 | 0.5509404993547384 |
2012-01-13T00:00:00+00:00 | 0.10166975999865946 |
2012-01-16T00:00:00+00:00 | 0.34183413903843207 |
2012-01-19T00:00:00+00:00 | 0.018428168123970967 |
2012-01-22T00:00:00+00:00 | 0.7792652522504137 |
2012-01-25T00:00:00+00:00 | 0.24793667731961144 |
2012-01-28T00:00:00+00:00 | 0.7200752551979407 |
2012-01-31T00:00:00+00:00 | 0.770756064084555 |
2012-02-03T00:00:00+00:00 | 0.6475396341969668 |
2012-02-06T00:00:00+00:00 | 0.00034544180080875453 |
2012-02-09T00:00:00+00:00 | 0.9881939271758362 |
2012-02-12T00:00:00+00:00 | 0.042428559674003274 |
2012-02-15T00:00:00+00:00 | 0.6604582692043693 |
2012-02-18T00:00:00+00:00 | 0.6446959879056338 |
2012-02-21T00:00:00+00:00 | 0.11606340772777746 |
2012-02-24T00:00:00+00:00 | 0.5238981665473298 |
2012-02-27T00:00:00+00:00 | 0.25979569124671453 |
2012-03-01T00:00:00+00:00 | 0.1808967702663009 |
2012-03-04T00:00:00+00:00 | 0.04614156947957693 |
2012-03-07T00:00:00+00:00 | 0.8935716437439504 |
2012-03-10T00:00:00+00:00 | 0.7197074871013468 |
2012-03-13T00:00:00+00:00 | 0.20741375904156445 |
2012-03-16T00:00:00+00:00 | 0.501647901862296 |
2012-03-19T00:00:00+00:00 | 0.9470421480253584 |
2012-03-22T00:00:00+00:00 | 0.2954430257659184 |
2012-03-25T00:00:00+00:00 | 0.18422816661946229 |
2012-03-28T00:00:00+00:00 | 0.48737285121462925 |
2012-03-31T00:00:00+00:00 | 0.7549290269495055 |
2012-04-03T00:00:00+00:00 | 0.8216050188191338 |
... | ... |
2012-12-29T00:00:00+00:00 | 0.26155523165437944 |
Or to access data whose time stamp is March 2012...
timeseries['2012-3']
Daru::Vector:14832480 size: 11 | |
---|---|
nil | |
2012-03-01T00:00:00+00:00 | 0.1808967702663009 |
2012-03-04T00:00:00+00:00 | 0.04614156947957693 |
2012-03-07T00:00:00+00:00 | 0.8935716437439504 |
2012-03-10T00:00:00+00:00 | 0.7197074871013468 |
2012-03-13T00:00:00+00:00 | 0.20741375904156445 |
2012-03-16T00:00:00+00:00 | 0.501647901862296 |
2012-03-19T00:00:00+00:00 | 0.9470421480253584 |
2012-03-22T00:00:00+00:00 | 0.2954430257659184 |
2012-03-25T00:00:00+00:00 | 0.18422816661946229 |
2012-03-28T00:00:00+00:00 | 0.48737285121462925 |
2012-03-31T00:00:00+00:00 | 0.7549290269495055 |
Specifying the date precisely will return the exact data point (You can also pass a ruby DateTime object for precisely obtaining data).
timeseries['2012-3-10']
0.7197074871013468
Say you have per second data about the price of a commodity and want to access the prices for the minute on 23rd of March 2012 at 12:42 pm
index = Daru::DateTimeIndex.date_range(
:start => '2012-3-23 11:00', :periods => 20000, :freq => 'S')
seconds_ts = Daru::Vector.new(20000.times.map { rand(50) }, index: index)
seconds_ts['2012-3-23 12:42']
Daru::Vector:28416340 size: 60 | |
---|---|
nil | |
2012-03-23T12:42:00+00:00 | 4 |
2012-03-23T12:42:01+00:00 | 32 |
2012-03-23T12:42:02+00:00 | 35 |
2012-03-23T12:42:03+00:00 | 35 |
2012-03-23T12:42:04+00:00 | 14 |
2012-03-23T12:42:05+00:00 | 1 |
2012-03-23T12:42:06+00:00 | 43 |
2012-03-23T12:42:07+00:00 | 39 |
2012-03-23T12:42:08+00:00 | 20 |
2012-03-23T12:42:09+00:00 | 16 |
2012-03-23T12:42:10+00:00 | 43 |
2012-03-23T12:42:11+00:00 | 0 |
2012-03-23T12:42:12+00:00 | 27 |
2012-03-23T12:42:13+00:00 | 43 |
2012-03-23T12:42:14+00:00 | 43 |
2012-03-23T12:42:15+00:00 | 18 |
2012-03-23T12:42:16+00:00 | 35 |
2012-03-23T12:42:17+00:00 | 39 |
2012-03-23T12:42:18+00:00 | 35 |
2012-03-23T12:42:19+00:00 | 23 |
2012-03-23T12:42:20+00:00 | 25 |
2012-03-23T12:42:21+00:00 | 13 |
2012-03-23T12:42:22+00:00 | 5 |
2012-03-23T12:42:23+00:00 | 43 |
2012-03-23T12:42:24+00:00 | 13 |
2012-03-23T12:42:25+00:00 | 28 |
2012-03-23T12:42:26+00:00 | 2 |
2012-03-23T12:42:27+00:00 | 42 |
2012-03-23T12:42:28+00:00 | 29 |
2012-03-23T12:42:29+00:00 | 36 |
2012-03-23T12:42:30+00:00 | 44 |
2012-03-23T12:42:31+00:00 | 36 |
... | ... |
2012-03-23T12:42:59+00:00 | 8 |
Plotting a simple scatter plot from a DataFrame. Nyaplot integration provides interactivity.
DataFrame denoting Ice Cream sales of a particular food chain in a city according to the maximum recorded temperature in that city. It also lists the staff strength present in each city.
df = Daru::DataFrame.new({
:temperature => [30.4, 23.5, 44.5, 20.3, 34, 24, 31.45, 28.34, 37, 24],
:sales => [350, 150, 500, 200, 480, 250, 330, 400, 420, 560],
:city => ['Pune', 'Delhi']*5,
:staff => [15,20]*5
})
df
Daru::DataFrame:4800060 rows: 10 cols: 4 | ||||
---|---|---|---|---|
city | sales | staff | temperature | |
0 | Pune | 350 | 15 | 30.4 |
1 | Delhi | 150 | 20 | 23.5 |
2 | Pune | 500 | 15 | 44.5 |
3 | Delhi | 200 | 20 | 20.3 |
4 | Pune | 480 | 15 | 34 |
5 | Delhi | 250 | 20 | 24 |
6 | Pune | 330 | 15 | 31.45 |
7 | Delhi | 400 | 20 | 28.34 |
8 | Pune | 420 | 15 | 37 |
9 | Delhi | 560 | 20 | 24 |
The plot below is between Temperature in the city and the sales of ice cream.
df.plot(type: :scatter, x: :temperature, y: :sales) do |plot, diagram|
plot.x_label "Temperature"
plot.y_label "Sales"
plot.yrange [100, 600]
plot.xrange [15, 50]
diagram.tooltip_contents([:city, :staff])
# Set the color scheme for this diagram.
diagram.color(Nyaplot::Colors.qual)
# Change color of each point WRT to the city that it belongs to.
diagram.fill_by(:city)
# Shape each point WRT to the city that it belongs to.
diagram.shape_by(:city)
end
rng = Distribution::Normal.rng
#<Proc:0x0000000368b250@/home/ubuntu/.rvm/gems/ruby-2.2.1/gems/distribution-0.7.3/lib/distribution/normal/gsl.rb:8 (lambda)>
index = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000)
vector = Daru::Vector.new(1000.times.map {rng.call}, index: index)
vector = vector.cumsum
rolling_mean = vector.rolling_mean 60
GnuplotRB::Plot.new(
[vector , with: 'lines', title: 'Vector'],
[rolling_mean, with: 'lines', title: 'Rolling Mean'],
xlabel: 'Time', ylabel: 'Value'
)
df = Daru::DataFrame.new({
a: [1,2,3,4,5,6]*100,
b: ['a','b','c','d','e','f']*100,
c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)
df
Daru::DataFrame:5195920 rows: 600 cols: 3 | |||
---|---|---|---|
a | b | c | |
102 | 1 | a | 11 |
177 | 2 | b | 22 |
354 | 3 | c | 33 |
163 | 4 | d | 44 |
230 | 5 | e | 55 |
332 | 6 | f | 66 |
171 | 1 | a | 11 |
123 | 2 | b | 22 |
470 | 3 | c | 33 |
471 | 4 | d | 44 |
309 | 5 | e | 55 |
23 | 6 | f | 66 |
15 | 1 | a | 11 |
26 | 2 | b | 22 |
312 | 3 | c | 33 |
484 | 4 | d | 44 |
386 | 5 | e | 55 |
72 | 6 | f | 66 |
506 | 1 | a | 11 |
96 | 2 | b | 22 |
183 | 3 | c | 33 |
90 | 4 | d | 44 |
451 | 5 | e | 55 |
278 | 6 | f | 66 |
529 | 1 | a | 11 |
87 | 2 | b | 22 |
256 | 3 | c | 33 |
415 | 4 | d | 44 |
421 | 5 | e | 55 |
485 | 6 | f | 66 |
139 | 1 | a | 11 |
482 | 2 | b | 22 |
... | ... | ... | ... |
513 | 6 | f | 66 |
Compares with a bunch of scalar quantities and returns a DataFrame wherever they return *true*
df.where(df[:a].eq(2).or(df[:c].eq(55)))
Daru::DataFrame:14856680 rows: 200 cols: 3 | |||
---|---|---|---|
a | b | c | |
177 | 2 | b | 22 |
230 | 5 | e | 55 |
123 | 2 | b | 22 |
309 | 5 | e | 55 |
26 | 2 | b | 22 |
386 | 5 | e | 55 |
96 | 2 | b | 22 |
451 | 5 | e | 55 |
87 | 2 | b | 22 |
421 | 5 | e | 55 |
482 | 2 | b | 22 |
254 | 5 | e | 55 |
52 | 2 | b | 22 |
282 | 5 | e | 55 |
267 | 2 | b | 22 |
304 | 5 | e | 55 |
36 | 2 | b | 22 |
424 | 5 | e | 55 |
303 | 2 | b | 22 |
353 | 5 | e | 55 |
376 | 2 | b | 22 |
115 | 5 | e | 55 |
55 | 2 | b | 22 |
7 | 5 | e | 55 |
478 | 2 | b | 22 |
239 | 5 | e | 55 |
356 | 2 | b | 22 |
530 | 5 | e | 55 |
99 | 2 | b | 22 |
81 | 5 | e | 55 |
595 | 2 | b | 22 |
436 | 5 | e | 55 |
... | ... | ... | ... |
532 | 5 | e | 55 |