This notebook describes the time series functionality of daru. We'll go through some examples of creating and interacting with a time series and also see the functionality that is offered by the specialized index that deals with time series data, called DateTimeIndex. A few functions that are particularly useful for analyzing time-based data will also be demoed.
At the end we'll see how a time series can be visualized using the excellent GNU plot gem.
require 'daru'
require 'awesome_print'
true
For a Daru::Vector or DataFrame to qualify as timeseries, it must be indexed using the Daru::DateTimeIndex class. A DateTimeIndex class can be created by using the .date_range function or by using the class constructor directly.
The DateTimeIndex.date_range function accepts the following options as parameters:
:end
.If you specify :start and :end options as strings, they can be complete or partial dates and daru will intelligently infer the date from the string directly. However, note that the date-like string must be in the format YYYY-MM-DD HH:MM:SS
. Currently the precision of DateTimeIndex is upto seconds only, though this will improve in the future.
# In the code below we will create a DateTimeIndex starting from 2012-4-4 to 2012-4-19
# with a daily frequency. The 'D' supplied to the :freq argument specifies that frequency
# has to be daily. It can be any of the string offset alaises amongst those supported. See
# the section below for a complete overview of date offsets.
index = Daru::DateTimeIndex.date_range(:start => '2012-4-4', :end => '2012-4-19', :freq => 'D')
#<DateTimeIndex:20856340 offset=D periods=16 data=[2012-04-04T00:00:00+00:00...2012-04-19T00:00:00+00:00]>
As you see above .date_range
has created a DateTimeIndex with 16 dates (or periods) with a daily frequency between each date.
Converting this index to an Array shows that this is true:
ap index.to_a
nil
[ [ 0] #<DateTime: 2012-04-04T00:00:00+00:00 ((2456022j,0s,0n),+0s,2299161j)>, [ 1] #<DateTime: 2012-04-05T00:00:00+00:00 ((2456023j,0s,0n),+0s,2299161j)>, [ 2] #<DateTime: 2012-04-06T00:00:00+00:00 ((2456024j,0s,0n),+0s,2299161j)>, [ 3] #<DateTime: 2012-04-07T00:00:00+00:00 ((2456025j,0s,0n),+0s,2299161j)>, [ 4] #<DateTime: 2012-04-08T00:00:00+00:00 ((2456026j,0s,0n),+0s,2299161j)>, [ 5] #<DateTime: 2012-04-09T00:00:00+00:00 ((2456027j,0s,0n),+0s,2299161j)>, [ 6] #<DateTime: 2012-04-10T00:00:00+00:00 ((2456028j,0s,0n),+0s,2299161j)>, [ 7] #<DateTime: 2012-04-11T00:00:00+00:00 ((2456029j,0s,0n),+0s,2299161j)>, [ 8] #<DateTime: 2012-04-12T00:00:00+00:00 ((2456030j,0s,0n),+0s,2299161j)>, [ 9] #<DateTime: 2012-04-13T00:00:00+00:00 ((2456031j,0s,0n),+0s,2299161j)>, [10] #<DateTime: 2012-04-14T00:00:00+00:00 ((2456032j,0s,0n),+0s,2299161j)>, [11] #<DateTime: 2012-04-15T00:00:00+00:00 ((2456033j,0s,0n),+0s,2299161j)>, [12] #<DateTime: 2012-04-16T00:00:00+00:00 ((2456034j,0s,0n),+0s,2299161j)>, [13] #<DateTime: 2012-04-17T00:00:00+00:00 ((2456035j,0s,0n),+0s,2299161j)>, [14] #<DateTime: 2012-04-18T00:00:00+00:00 ((2456036j,0s,0n),+0s,2299161j)>, [15] #<DateTime: 2012-04-19T00:00:00+00:00 ((2456037j,0s,0n),+0s,2299161j)> ]
Specifying a number before the date alias in the :freq
option will set the frequency to a multiple of the offset. Using this you can create date ranges with frequency in multiples of whatever you want.
# The following code will create a range between 2014-5-1 00:00:00 and 2014,5,2 00:00:00,
# with a difference of 6 hours between each date.
index = Daru::DateTimeIndex.date_range(:start => DateTime.new(2014,5,1), :end => DateTime.new(2014,5,2), :freq => '6H')
ap index.to_a; nil
[ [0] #<DateTime: 2014-05-01T00:00:00+00:00 ((2456779j,0s,0n),+0s,2299161j)>, [1] #<DateTime: 2014-05-01T06:00:00+00:00 ((2456779j,21600s,0n),+0s,2299161j)>, [2] #<DateTime: 2014-05-01T12:00:00+00:00 ((2456779j,43200s,0n),+0s,2299161j)>, [3] #<DateTime: 2014-05-01T18:00:00+00:00 ((2456779j,64800s,0n),+0s,2299161j)>, [4] #<DateTime: 2014-05-02T00:00:00+00:00 ((2456780j,0s,0n),+0s,2299161j)> ]
The freqeuncy strings that you just saw are translated under the hood into objects of type Daru::Offsets
. These offsets determine the distance with which the dates are shifted. See this blog post for a detailed coverage of date offsets and thier string aliases.
:freq
can also accept a Daru::DateOffset
object or any of the objects under the namespace of Daru::Offsets
. For example, to create a date range that has a frequency of 6 seconds:
offset = Daru::Offsets::Second.new(6)
index = Daru::DateTimeIndex.date_range(:start => '2012-5-6', :end => '2012-5-6 20:00:00', :freq => offset)
#<DateTimeIndex:18324260 offset=6S periods=12001 data=[2012-05-06T00:00:00+00:00...2012-05-06T20:00:00+00:00]>
Another way to specify the range of the date index is to use the :periods
option. This option will decide exactly how many index objects will go into DateTimeIndex, and will take precedence over whatever date is specified in the :end
option.
So to create an index of 50 periods starting from the date '2012-5-2' with a frequency of one month end between each:
index = Daru::DateTimeIndex.date_range(:start => '2012-5-2', :periods => 50, :freq => 'ME')
#<DateTimeIndex:14838220 offset=ME periods=50 data=[2012-05-31T00:00:00+00:00...2016-06-30T00:00:00+00:00]>
You can ask for the frequency of an index with the #frequency
method.
index.frequency
"ME"
The DateTimeIndex constructor allows you to create DateTimeIndex even if the dates are not separated by a particular frequency.
index = Daru::DateTimeIndex.new(
[DateTime.new(2012,4,5), DateTime.new(2012,4,6), DateTime.new(2012,4,7), DateTime.new(2012,4,8)])
#<DateTimeIndex:13228320 offset=nil periods=4 data=[2012-04-05T00:00:00+00:00...2012-04-08T00:00:00+00:00]>
The constructor also accepts an optional :freq
option that allows you to either pass a frequency string alias or an offset object. If you want daru to infer the frequency of your data by itself, pass it the :infer
option and it will try to figure out the frequency of the data by itself (if a frequency cannot be inferred it will be set to nil
).
index = Daru::DateTimeIndex.new([
DateTime.new(2012,4,5), DateTime.new(2012,4,6), DateTime.new(2012,4,7), DateTime.new(2012,4,8),
DateTime.new(2012,4,9), DateTime.new(2012,4,10), DateTime.new(2012,4,11), DateTime.new(2012,4,12)
], freq: :infer)
#<DateTimeIndex:13119080 offset=D periods=8 data=[2012-04-05T00:00:00+00:00...2012-04-12T00:00:00+00:00]>
The DateTimeIndex offers a host of methods for manipulating and knowing more about the data contained in the index. Let us consider a sample DateTimeIndex and demonstrate:
index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 10, :freq => 'YEAR')
#<DateTimeIndex:11607200 offset=YEAR periods=10 data=[2012-01-01T00:00:00+00:00...2021-01-01T00:00:00+00:00]>
You can get a Ruby Array of all the years that each of the indexes belongs to with the #year
method:
index.year
[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
Similarly you can query for #month
, #day
, #hour
, #min
or #sec
using the respective methods.
To move all the data points of a DateTimeIndex to the future, the #shift
method can be used, or to move all of them to the past, use the #lag
method.
Passing an offset to #shift will offset each data point by the offset value:
index.shift(Daru::Offsets::Hour.new(3))
#<DateTimeIndex:12276600 offset=nil periods=10 data=[2012-01-01T03:00:00+00:00...2021-01-01T03:00:00+00:00]>
Passing a positive integer into #shift will offset each data point by the same offset that it was created with:
index.shift(4) # Shift by 4 years
#<DateTimeIndex:12095520 offset=YEAR periods=10 data=[2016-01-01T00:00:00+00:00...2025-01-01T00:00:00+00:00]>
#lag
works in a similar manner:
index.lag(Daru::DateOffset.new(days: 4))
#<DateTimeIndex:11703700 offset=nil periods=10 data=[2011-12-28T00:00:00+00:00...2020-12-28T00:00:00+00:00]>
index.lag(2)
#<DateTimeIndex:10628980 offset=YEAR periods=10 data=[2010-01-01T00:00:00+00:00...2019-01-01T00:00:00+00:00]>
When used with Daru::Vector
or Daru::DataFrame
, DateTimeIndex functions exactly like any other index. You can query individual dates, slices, etc. and retrieve the relevant data by specifying the date either completely or partially.
One of the salient features of indexing time-based data with the DateTimeIndex is that it lets you retrieve data of a given time period by specifying just a partial data. We'll see how exactly this can be done with some examples:
For starters lets create a basic Daru::Vector that is indexed on DateTimeIndex.
index = Daru::DateTimeIndex.date_range(:start => '2012-3-4', :periods => 50000, :freq => 'H')
vector = Daru::Vector.new([1,2,3,4,5]*10000, index: index)
vector.head
Daru::Vector:36599420 size: 10 | |
---|---|
nil | |
2012-03-04T00:00:00+00:00 | 1 |
2012-03-04T01:00:00+00:00 | 2 |
2012-03-04T02:00:00+00:00 | 3 |
2012-03-04T03:00:00+00:00 | 4 |
2012-03-04T04:00:00+00:00 | 5 |
2012-03-04T05:00:00+00:00 | 1 |
2012-03-04T06:00:00+00:00 | 2 |
2012-03-04T07:00:00+00:00 | 3 |
2012-03-04T08:00:00+00:00 | 4 |
2012-03-04T09:00:00+00:00 | 5 |
You can retrieve data by specifying the date completely or partially. Specifying it partially will retrive all the data that falls under that time period. For example, to retreive all the data that falls under April 2012 (thats '2012-4'):
vector['2012-4']
Daru::Vector:37106300 size: 720 | |
---|---|
nil | |
2012-04-01T00:00:00+00:00 | 3 |
2012-04-01T01:00:00+00:00 | 4 |
2012-04-01T02:00:00+00:00 | 5 |
2012-04-01T03:00:00+00:00 | 1 |
2012-04-01T04:00:00+00:00 | 2 |
2012-04-01T05:00:00+00:00 | 3 |
2012-04-01T06:00:00+00:00 | 4 |
2012-04-01T07:00:00+00:00 | 5 |
2012-04-01T08:00:00+00:00 | 1 |
2012-04-01T09:00:00+00:00 | 2 |
2012-04-01T10:00:00+00:00 | 3 |
2012-04-01T11:00:00+00:00 | 4 |
2012-04-01T12:00:00+00:00 | 5 |
2012-04-01T13:00:00+00:00 | 1 |
2012-04-01T14:00:00+00:00 | 2 |
2012-04-01T15:00:00+00:00 | 3 |
2012-04-01T16:00:00+00:00 | 4 |
2012-04-01T17:00:00+00:00 | 5 |
2012-04-01T18:00:00+00:00 | 1 |
2012-04-01T19:00:00+00:00 | 2 |
2012-04-01T20:00:00+00:00 | 3 |
2012-04-01T21:00:00+00:00 | 4 |
2012-04-01T22:00:00+00:00 | 5 |
2012-04-01T23:00:00+00:00 | 1 |
2012-04-02T00:00:00+00:00 | 2 |
2012-04-02T01:00:00+00:00 | 3 |
2012-04-02T02:00:00+00:00 | 4 |
2012-04-02T03:00:00+00:00 | 5 |
2012-04-02T04:00:00+00:00 | 1 |
2012-04-02T05:00:00+00:00 | 2 |
2012-04-02T06:00:00+00:00 | 3 |
2012-04-02T07:00:00+00:00 | 4 |
... | ... |
2012-04-30T23:00:00+00:00 | 2 |
As you can see only the data with an index on April 2012 was retreived.
Now, say you want all the data under the year 2013. You can just specify the year as a string:
vector['2013']
Daru::Vector:38535340 size: 8760 | |
---|---|
nil | |
2013-01-01T00:00:00+00:00 | 3 |
2013-01-01T01:00:00+00:00 | 4 |
2013-01-01T02:00:00+00:00 | 5 |
2013-01-01T03:00:00+00:00 | 1 |
2013-01-01T04:00:00+00:00 | 2 |
2013-01-01T05:00:00+00:00 | 3 |
2013-01-01T06:00:00+00:00 | 4 |
2013-01-01T07:00:00+00:00 | 5 |
2013-01-01T08:00:00+00:00 | 1 |
2013-01-01T09:00:00+00:00 | 2 |
2013-01-01T10:00:00+00:00 | 3 |
2013-01-01T11:00:00+00:00 | 4 |
2013-01-01T12:00:00+00:00 | 5 |
2013-01-01T13:00:00+00:00 | 1 |
2013-01-01T14:00:00+00:00 | 2 |
2013-01-01T15:00:00+00:00 | 3 |
2013-01-01T16:00:00+00:00 | 4 |
2013-01-01T17:00:00+00:00 | 5 |
2013-01-01T18:00:00+00:00 | 1 |
2013-01-01T19:00:00+00:00 | 2 |
2013-01-01T20:00:00+00:00 | 3 |
2013-01-01T21:00:00+00:00 | 4 |
2013-01-01T22:00:00+00:00 | 5 |
2013-01-01T23:00:00+00:00 | 1 |
2013-01-02T00:00:00+00:00 | 2 |
2013-01-02T01:00:00+00:00 | 3 |
2013-01-02T02:00:00+00:00 | 4 |
2013-01-02T03:00:00+00:00 | 5 |
2013-01-02T04:00:00+00:00 | 1 |
2013-01-02T05:00:00+00:00 | 2 |
2013-01-02T06:00:00+00:00 | 3 |
2013-01-02T07:00:00+00:00 | 4 |
... | ... |
2013-12-31T23:00:00+00:00 | 2 |
Passing a string to #[]
evaluates it to the greatest possible accuracy and then retrieves the relevant data. Now say you want the data that happens to be on 4th February 2013. Just specify this as a string:
vector['2013-2-4']
Daru::Vector:40565680 size: 24 | |
---|---|
nil | |
2013-02-04T00:00:00+00:00 | 4 |
2013-02-04T01:00:00+00:00 | 5 |
2013-02-04T02:00:00+00:00 | 1 |
2013-02-04T03:00:00+00:00 | 2 |
2013-02-04T04:00:00+00:00 | 3 |
2013-02-04T05:00:00+00:00 | 4 |
2013-02-04T06:00:00+00:00 | 5 |
2013-02-04T07:00:00+00:00 | 1 |
2013-02-04T08:00:00+00:00 | 2 |
2013-02-04T09:00:00+00:00 | 3 |
2013-02-04T10:00:00+00:00 | 4 |
2013-02-04T11:00:00+00:00 | 5 |
2013-02-04T12:00:00+00:00 | 1 |
2013-02-04T13:00:00+00:00 | 2 |
2013-02-04T14:00:00+00:00 | 3 |
2013-02-04T15:00:00+00:00 | 4 |
2013-02-04T16:00:00+00:00 | 5 |
2013-02-04T17:00:00+00:00 | 1 |
2013-02-04T18:00:00+00:00 | 2 |
2013-02-04T19:00:00+00:00 | 3 |
2013-02-04T20:00:00+00:00 | 4 |
2013-02-04T21:00:00+00:00 | 5 |
2013-02-04T22:00:00+00:00 | 1 |
2013-02-04T23:00:00+00:00 | 2 |
Passing accuracy upto minutes will return precisely that data point, because the highest accuracy of the index is minutes.
vector['2013-2-4 22']
1
For specifying dates precisely, it is even possible to pass a DateTime object into #[]
:
vector[DateTime.new(2012,5,1)]
3
DateTimeIndex can be used with DataFrame the way it was used with Vector. We can index both rows and columns of a DataFrame using a DateTimeIndex:
index = Daru::DateTimeIndex.date_range(:start => '2012-4-5', :periods => 50, :freq => 'D')
df = Daru::DataFrame.new({
a: [1,2,3,4,5]*10,
b: ['a','b','c','d','e']*10,
c: ['foo', 'bar','baz','razz','jazz']*10
}, index: index)
Daru::DataFrame:40773500 rows: 50 cols: 3 | |||
---|---|---|---|
a | b | c | |
2012-04-05T00:00:00+00:00 | 1 | a | foo |
2012-04-06T00:00:00+00:00 | 2 | b | bar |
2012-04-07T00:00:00+00:00 | 3 | c | baz |
2012-04-08T00:00:00+00:00 | 4 | d | razz |
2012-04-09T00:00:00+00:00 | 5 | e | jazz |
2012-04-10T00:00:00+00:00 | 1 | a | foo |
2012-04-11T00:00:00+00:00 | 2 | b | bar |
2012-04-12T00:00:00+00:00 | 3 | c | baz |
2012-04-13T00:00:00+00:00 | 4 | d | razz |
2012-04-14T00:00:00+00:00 | 5 | e | jazz |
2012-04-15T00:00:00+00:00 | 1 | a | foo |
2012-04-16T00:00:00+00:00 | 2 | b | bar |
2012-04-17T00:00:00+00:00 | 3 | c | baz |
2012-04-18T00:00:00+00:00 | 4 | d | razz |
2012-04-19T00:00:00+00:00 | 5 | e | jazz |
2012-04-20T00:00:00+00:00 | 1 | a | foo |
2012-04-21T00:00:00+00:00 | 2 | b | bar |
2012-04-22T00:00:00+00:00 | 3 | c | baz |
2012-04-23T00:00:00+00:00 | 4 | d | razz |
2012-04-24T00:00:00+00:00 | 5 | e | jazz |
2012-04-25T00:00:00+00:00 | 1 | a | foo |
2012-04-26T00:00:00+00:00 | 2 | b | bar |
2012-04-27T00:00:00+00:00 | 3 | c | baz |
2012-04-28T00:00:00+00:00 | 4 | d | razz |
2012-04-29T00:00:00+00:00 | 5 | e | jazz |
2012-04-30T00:00:00+00:00 | 1 | a | foo |
2012-05-01T00:00:00+00:00 | 2 | b | bar |
2012-05-02T00:00:00+00:00 | 3 | c | baz |
2012-05-03T00:00:00+00:00 | 4 | d | razz |
2012-05-04T00:00:00+00:00 | 5 | e | jazz |
2012-05-05T00:00:00+00:00 | 1 | a | foo |
2012-05-06T00:00:00+00:00 | 2 | b | bar |
... | ... | ... | ... |
2012-05-24T00:00:00+00:00 | 5 | e | jazz |
Rows can be retreived using a syntax similar to that of Daru::Vector:
df.row['2012-5']
Daru::DataFrame:40777240 rows: 24 cols: 3 | |||
---|---|---|---|
a | b | c | |
2012-05-01T00:00:00+00:00 | 2 | b | bar |
2012-05-02T00:00:00+00:00 | 3 | c | baz |
2012-05-03T00:00:00+00:00 | 4 | d | razz |
2012-05-04T00:00:00+00:00 | 5 | e | jazz |
2012-05-05T00:00:00+00:00 | 1 | a | foo |
2012-05-06T00:00:00+00:00 | 2 | b | bar |
2012-05-07T00:00:00+00:00 | 3 | c | baz |
2012-05-08T00:00:00+00:00 | 4 | d | razz |
2012-05-09T00:00:00+00:00 | 5 | e | jazz |
2012-05-10T00:00:00+00:00 | 1 | a | foo |
2012-05-11T00:00:00+00:00 | 2 | b | bar |
2012-05-12T00:00:00+00:00 | 3 | c | baz |
2012-05-13T00:00:00+00:00 | 4 | d | razz |
2012-05-14T00:00:00+00:00 | 5 | e | jazz |
2012-05-15T00:00:00+00:00 | 1 | a | foo |
2012-05-16T00:00:00+00:00 | 2 | b | bar |
2012-05-17T00:00:00+00:00 | 3 | c | baz |
2012-05-18T00:00:00+00:00 | 4 | d | razz |
2012-05-19T00:00:00+00:00 | 5 | e | jazz |
2012-05-20T00:00:00+00:00 | 1 | a | foo |
2012-05-21T00:00:00+00:00 | 2 | b | bar |
2012-05-22T00:00:00+00:00 | 3 | c | baz |
2012-05-23T00:00:00+00:00 | 4 | d | razz |
2012-05-24T00:00:00+00:00 | 5 | e | jazz |