
#-2 | Rolling a Join
The data.table package in R facilitates rolling joins. I’ve stumbled across the term a few times so I thought I should learn about it. My first Google search for rolling join
yielded a mix of data topics…and other content that wasn’t work safe. I carefully navigated to a few different pages and did some reading. I didn’t understand what was going on though. So I created some toy examples of my own and with some tinkering I got to grips with the technique.
Nomenclature
According to the data.table intro vignette a rolling join is also called a fast ordered join or last observation carried forward join. The basic idea is that if there is no matching value for a row being looked up during the join, then the value from the closest record is matched in lieu. I guess each of the names kind of makes sense to me but I’m sticking with rolling join because otherwise the title of this post wouldn’t work.
Making up data
To start with I made up a couple of pretend data.tables to join together.
dt <- data.table(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
t=c(100, 200, 300, 400, 500, 600, 700, 800, 100, 200, 300, 400, 500, 600, 700, 800),
key=c('id', 't'))
dt
## id t
## 1: 1 100
## 2: 1 200
## 3: 1 300
## 4: 1 400
## 5: 1 500
## 6: 1 600
## 7: 1 700
## 8: 1 800
## 9: 2 100
## 10: 2 200
## 11: 2 300
## 12: 2 400
## 13: 2 500
## 14: 2 600
## 15: 2 700
## 16: 2 800
The id
column can be any arbitrary entity but here I’ve just used 1 and 2. The column t
represents time, or something similar to that, and again I’ve just picked arbitrary values. id
is set as the first key column and t
is set as the second key column - the order will matter later.
Say we want to add another column, the colour associated with an id
at time t
.
dt.colour <- data.table(id=c(1, 2, 2),
t=c(300, 200, 600),
colour=c('blue', 'yellow', 'red'), key=c('id', 't'))
dt.colour
## id t colour
## 1: 1 300 blue
## 2: 2 200 yellow
## 3: 2 600 red
So id
1 was associated with ‘blue’ at time 300 and id
2 was associated with ‘yellow’ at time 200 and ‘red’ at time 600. The keys are set the same as dt
.
Let’s join the two data.tables together to add the colours. So for each entry in dt
we want to try and find any matching values in dt.colour
. In data.table land this is done like this
dt.colour[dt,]
## id t colour
## 1: 1 100 NA
## 2: 1 200 NA
## 3: 1 300 blue
## 4: 1 400 NA
## 5: 1 500 NA
## 6: 1 600 NA
## 7: 1 700 NA
## 8: 1 800 NA
## 9: 2 100 NA
## 10: 2 200 yellow
## 11: 2 300 NA
## 12: 2 400 NA
## 13: 2 500 NA
## 14: 2 600 red
## 15: 2 700 NA
## 16: 2 800 NA
As expected the colour column has been added with values if possible and with NAs inserted where there is no match.
Rolling
Let’s set roll=Inf
and see what happens
dt.colour[dt, , roll=Inf]
## id t colour
## 1: 1 100 NA
## 2: 1 200 NA
## 3: 1 300 blue
## 4: 1 400 blue
## 5: 1 500 blue
## 6: 1 600 blue
## 7: 1 700 blue
## 8: 1 800 blue
## 9: 2 100 NA
## 10: 2 200 yellow
## 11: 2 300 yellow
## 12: 2 400 yellow
## 13: 2 500 yellow
## 14: 2 600 red
## 15: 2 700 red
## 16: 2 800 red
For rows that have no exact key match, the last value matching the first key is rolled forward. The last value is defined by the order of the final key column - in this case t
. The roll
argument is set to Inf
to specify that there is no limit to how far a value can be rolled forward.
From here on I’m going to add the unrolled version of the colour column to dt
so as the rolled result can be easily compared to non-rolled result.
dt[, non.rolled.colour := dt.colour[dt, colour, roll=FALSE]]
dt
## id t non.rolled.colour
## 1: 1 100 NA
## 2: 1 200 NA
## 3: 1 300 blue
## 4: 1 400 NA
## 5: 1 500 NA
## 6: 1 600 NA
## 7: 1 700 NA
## 8: 1 800 NA
## 9: 2 100 NA
## 10: 2 200 yellow
## 11: 2 300 NA
## 12: 2 400 NA
## 13: 2 500 NA
## 14: 2 600 red
## 15: 2 700 NA
## 16: 2 800 NA
Equivalent arguments
Note that roll=TRUE
and roll=T
are equivalent to roll=Inf
.
Rolling backwards
Rolling can be reversed by setting roll=-Inf
. Now values are rolled back instead of forward.
dt.colour[dt, , roll=-Inf]
## id t colour non.rolled.colour
## 1: 1 100 blue NA
## 2: 1 200 blue NA
## 3: 1 300 blue blue
## 4: 1 400 NA NA
## 5: 1 500 NA NA
## 6: 1 600 NA NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 yellow NA
## 10: 2 200 yellow yellow
## 11: 2 300 red NA
## 12: 2 400 red NA
## 13: 2 500 red NA
## 14: 2 600 red red
## 15: 2 700 NA NA
## 16: 2 800 NA NA
Limiting the roll
By setting roll to a number you can limit how far values are rolled forward or back. The value specified is relative to the second key column, in this case t
.
dt.colour[dt, , roll=100]
## id t colour non.rolled.colour
## 1: 1 100 NA NA
## 2: 1 200 NA NA
## 3: 1 300 blue blue
## 4: 1 400 blue NA
## 5: 1 500 NA NA
## 6: 1 600 NA NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 NA NA
## 10: 2 200 yellow yellow
## 11: 2 300 yellow NA
## 12: 2 400 NA NA
## 13: 2 500 NA NA
## 14: 2 600 red red
## 15: 2 700 red NA
## 16: 2 800 NA NA
Since t increases in increments of 100 for each id, the colour is rolled forward one row. So with these datatables if roll=99
then nothing will be rolled forward and the result is the same as roll=FALSE
.
dt.colour[dt, , roll=99]
## id t colour non.rolled.colour
## 1: 1 100 NA NA
## 2: 1 200 NA NA
## 3: 1 300 blue blue
## 4: 1 400 NA NA
## 5: 1 500 NA NA
## 6: 1 600 NA NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 NA NA
## 10: 2 200 yellow yellow
## 11: 2 300 NA NA
## 12: 2 400 NA NA
## 13: 2 500 NA NA
## 14: 2 600 red red
## 15: 2 700 NA NA
## 16: 2 800 NA NA
What happens if roll isn’t an integer? It seems to work as you would expect.
dt.colour[dt, , roll=100.1]
## id t colour non.rolled.colour
## 1: 1 100 NA NA
## 2: 1 200 NA NA
## 3: 1 300 blue blue
## 4: 1 400 blue NA
## 5: 1 500 NA NA
## 6: 1 600 NA NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 NA NA
## 10: 2 200 yellow yellow
## 11: 2 300 yellow NA
## 12: 2 400 NA NA
## 13: 2 500 NA NA
## 14: 2 600 red red
## 15: 2 700 red NA
## 16: 2 800 NA NA
Here’s an example with a higher value of roll
.
dt.colour[dt, , roll=300]
## id t colour non.rolled.colour
## 1: 1 100 NA NA
## 2: 1 200 NA NA
## 3: 1 300 blue blue
## 4: 1 400 blue NA
## 5: 1 500 blue NA
## 6: 1 600 blue NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 NA NA
## 10: 2 200 yellow yellow
## 11: 2 300 yellow NA
## 12: 2 400 yellow NA
## 13: 2 500 yellow NA
## 14: 2 600 red red
## 15: 2 700 red NA
## 16: 2 800 red NA
Note again that when you specify a number for the roll
argument it is how far a particular value can be rolled forward or back. It is worth pointing out that it can also be thought of as how far back a row missing a value is willing to look to find an alternative. This means you can use it to, say, find the most recent value within a certain window.
Selective Roll
I couldn’t figure out how to roll on some columns but not others. I don’t think this is currently possible.
Nearest Value
You can also set roll=nearest
. Let’s see what this does.
dt.colour[dt, , roll='nearest']
## id t colour non.rolled.colour
## 1: 1 100 blue NA
## 2: 1 200 blue NA
## 3: 1 300 blue blue
## 4: 1 400 blue NA
## 5: 1 500 blue NA
## 6: 1 600 blue NA
## 7: 1 700 blue NA
## 8: 1 800 blue NA
## 9: 2 100 yellow NA
## 10: 2 200 yellow yellow
## 11: 2 300 yellow NA
## 12: 2 400 yellow NA
## 13: 2 500 red NA
## 14: 2 600 red red
## 15: 2 700 red NA
## 16: 2 800 red NA
It looks like the nearest value, either forwards or backwards, is taken. You can see that there is a tie when id=2
and t=400
and in this case the value found by looking backwards seems to be taken over the one found by looking forwards. But what if you want to look only so far backwards and forwards?
Rollends
There is also the rollends
argument which by default is set to c(TRUE, FALSE)
if roll
is negative and c(FALSE, TRUE)
if it is positive or zero (I suppose roll=TRUE
is considered positive). So the first element of rollends
is whether to roll backwards and the second element is whether to roll forwards. So by setting rollends=c(TRUE, TRUE)
you can roll backwards and forwards whilst also specifying how far to look.
dt.colour[dt, , roll=200, rollends=c(TRUE,TRUE)]
## id t colour non.rolled.colour
## 1: 1 100 blue NA
## 2: 1 200 blue NA
## 3: 1 300 blue blue
## 4: 1 400 blue NA
## 5: 1 500 blue NA
## 6: 1 600 NA NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 yellow NA
## 10: 2 200 yellow yellow
## 11: 2 300 yellow NA
## 12: 2 400 yellow NA
## 13: 2 500 NA NA
## 14: 2 600 red red
## 15: 2 700 red NA
## 16: 2 800 red NA
What happens when you set rollends=c(FALSE, FALSE)
. I expected this would mean no rolling, but that doesn’t seem to be the case. The documentation says
If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column.
First note that rollends=FALSE
is the same as rollends=c(FALSE, FALSE)
because the value is recycled. Next, in our case the x
referred to is dt.colours
and i
is dt
. So in order for a row in dt
to find a match, its value of t
must fall in between the values of t
for two rows in dt.colours
that have the same value of id
, the first key column. Let’s try it with roll=Inf
. It might be a good idea to print out dt.colours
as well for reference.
dt.colour
## id t colour
## 1: 1 300 blue
## 2: 2 200 yellow
## 3: 2 600 red
# Note roll=Inf
dt.colour[dt, , roll=Inf, rollends=c(FALSE, FALSE)]
## id t colour non.rolled.colour
## 1: 1 100 NA NA
## 2: 1 200 NA NA
## 3: 1 300 blue blue
## 4: 1 400 NA NA
## 5: 1 500 NA NA
## 6: 1 600 NA NA
## 7: 1 700 NA NA
## 8: 1 800 NA NA
## 9: 2 100 NA NA
## 10: 2 200 yellow yellow
## 11: 2 300 yellow NA
## 12: 2 400 yellow NA
## 13: 2 500 yellow NA
## 14: 2 600 red red
## 15: 2 700 NA NA
## 16: 2 800 NA NA
Only rows of dt
that have a value of t
in between values of t
in dt.colour can roll previous values
forward. So when id
is 1 there is no rolling since there is only one applicable row in dt.colours
meaning nothing can be in between anything. However for dt
equal 2 the rows with values of t
from 200-500 are all in between the two entries with an id
of 2 in dt.colour
. So it seems that rollends=FALSE
is a way of turning off rolling on the edges of the data.
Multiple Grouping Columns
Everything used in this post has used two key columns id
and t
when performing rolling joins. However you can use as many as you like with the understanding that the last key column will always be used for rolling and the rest for defining groups. As a finish here’s an example of this.
dt <- data.table(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
id2=c(rep('a', 4), rep('b', 4), rep('a', 4), rep('b', 4)),
t=c(100, 200, 300, 400, 500, 600, 700, 800, 100, 200, 300, 400, 500, 600, 700, 800),
key=c('id', 'id2', 't'))
dt
## id id2 t
## 1: 1 a 100
## 2: 1 a 200
## 3: 1 a 300
## 4: 1 a 400
## 5: 1 b 500
## 6: 1 b 600
## 7: 1 b 700
## 8: 1 b 800
## 9: 2 a 100
## 10: 2 a 200
## 11: 2 a 300
## 12: 2 a 400
## 13: 2 b 500
## 14: 2 b 600
## 15: 2 b 700
## 16: 2 b 800
dt.colour <- data.table(id=c(1, 2, 2),
id2=c('b', 'b', 'b'),
t=c(300, 200, 600),
colour=c('blue', 'yellow', 'red'), key=c('id', 'id2', 't'))
dt.colour
## id id2 t colour
## 1: 1 b 300 blue
## 2: 2 b 200 yellow
## 3: 2 b 600 red
# Add non rolled data for reference
dt[, non.rolled.colour := dt.colour[dt, colour, roll=FALSE]]
dt.colour[dt, , roll=TRUE]
## id id2 t colour non.rolled.colour
## 1: 1 a 100 NA NA
## 2: 1 a 200 NA NA
## 3: 1 a 300 NA NA
## 4: 1 a 400 NA NA
## 5: 1 b 500 blue NA
## 6: 1 b 600 blue NA
## 7: 1 b 700 blue NA
## 8: 1 b 800 blue NA
## 9: 2 a 100 NA NA
## 10: 2 a 200 NA NA
## 11: 2 a 300 NA NA
## 12: 2 a 400 NA NA
## 13: 2 b 500 yellow NA
## 14: 2 b 600 red red
## 15: 2 b 700 red NA
## 16: 2 b 800 red NA
Corrections
2017-07-02: Updated the Rollends section to state that the default value of rollends
when roll
is positive or zero is c(FALSE, TRUE)
rather than c(TRUE, FALSE)
.