#-2 | Rolling a Join
The data.table package in R facilitates rolling joins. I’ve stumbled across the term a few times so I thought I should learn about it. My first Google search for rolling join
yielded a mix of data topics…and other content that wasn’t work safe. I carefully navigated to a few different pages and did some reading. I didn’t understand what was going on though. So I created some toy examples of my own and with some tinkering I got to grips with the technique.
Nomenclature
According to the data.table intro vignette a rolling join is also called a fast ordered join or last observation carried forward join. The basic idea is that if there is no matching value for a row being looked up during the join, then the value from the closest record is matched in lieu. I guess each of the names kind of makes sense to me but I’m sticking with rolling join because otherwise the title of this post wouldn’t work.
Making up data
To start with I made up a couple of pretend data.tables to join together.
The id
column can be any arbitrary entity but here I’ve just used 1 and 2. The column t
represents time, or something similar to that, and again I’ve just picked arbitrary values. id
is set as the first key column and t
is set as the second key column - the order will matter later.
Say we want to add another column, the colour associated with an id
at time t
.
So id
1 was associated with ‘blue’ at time 300 and id
2 was associated with ‘yellow’ at time 200 and ‘red’ at time 600. The keys are set the same as dt
.
Let’s join the two data.tables together to add the colours. So for each entry in dt
we want to try and find any matching values in dt.colour
. In data.table land this is done like this
As expected the colour column has been added with values if possible and with NAs inserted where there is no match.
Rolling
Let’s set roll=Inf
and see what happens
For rows that have no exact key match, the last value matching the first key is rolled forward. The last value is defined by the order of the final key column - in this case t
. The roll
argument is set to Inf
to specify that there is no limit to how far a value can be rolled forward.
From here on I’m going to add the unrolled version of the colour column to dt
so as the rolled result can be easily compared to non-rolled result.
Equivalent arguments
Note that roll=TRUE
and roll=T
are equivalent to roll=Inf
.
Rolling backwards
Rolling can be reversed by setting roll=-Inf
. Now values are rolled back instead of forward.
Limiting the roll
By setting roll to a number you can limit how far values are rolled forward or back. The value specified is relative to the second key column, in this case t
.
Since t increases in increments of 100 for each id, the colour is rolled forward one row. So with these datatables if roll=99
then nothing will be rolled forward and the result is the same as roll=FALSE
.
What happens if roll isn’t an integer? It seems to work as you would expect.
Here’s an example with a higher value of roll
.
Note again that when you specify a number for the roll
argument it is how far a particular value can be rolled forward or back. It is worth pointing out that it can also be thought of as how far back a row missing a value is willing to look to find an alternative. This means you can use it to, say, find the most recent value within a certain window.
Selective Roll
I couldn’t figure out how to roll on some columns but not others. I don’t think this is currently possible.
Nearest Value
You can also set roll=nearest
. Let’s see what this does.
It looks like the nearest value, either forwards or backwards, is taken. You can see that there is a tie when id=2
and t=400
and in this case the value found by looking backwards seems to be taken over the one found by looking forwards. But what if you want to look only so far backwards and forwards?
Rollends
There is also the rollends
argument which by default is set to c(TRUE, FALSE)
if roll
is negative and c(FALSE, TRUE)
if it is positive or zero (I suppose roll=TRUE
is considered positive). So the first element of rollends
is whether to roll backwards and the second element is whether to roll forwards. So by setting rollends=c(TRUE, TRUE)
you can roll backwards and forwards whilst also specifying how far to look.
What happens when you set rollends=c(FALSE, FALSE)
. I expected this would mean no rolling, but that doesn’t seem to be the case. The documentation says
If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column.
First note that rollends=FALSE
is the same as rollends=c(FALSE, FALSE)
because the value is recycled. Next, in our case the x
referred to is dt.colours
and i
is dt
. So in order for a row in dt
to find a match, its value of t
must fall in between the values of t
for two rows in dt.colours
that have the same value of id
, the first key column. Let’s try it with roll=Inf
. It might be a good idea to print out dt.colours
as well for reference.
Only rows of dt
that have a value of t
in between values of t
in dt.colour can roll previous values
forward. So when id
is 1 there is no rolling since there is only one applicable row in dt.colours
meaning nothing can be in between anything. However for dt
equal 2 the rows with values of t
from 200-500 are all in between the two entries with an id
of 2 in dt.colour
. So it seems that rollends=FALSE
is a way of turning off rolling on the edges of the data.
Multiple Grouping Columns
Everything used in this post has used two key columns id
and t
when performing rolling joins. However you can use as many as you like with the understanding that the last key column will always be used for rolling and the rest for defining groups. As a finish here’s an example of this.
Corrections
2017-07-02: Updated the Rollends section to state that the default value of rollends
when roll
is positive or zero is c(FALSE, TRUE)
rather than c(TRUE, FALSE)
.