I have a data table with 3 columns: pid ( a job) , starttime (time when it starts ) and fintime (time when it ends) as made up below:
require(data.table)
dt <- data.table(pid=sample(1:100,100), starttime = sample(1:100,100)/100)[,fintime:=starttime + round(runif(100)/4,2)]
I need to identify all possible TWO jobs that can be done sequentially but confirming to an acceptable “gap” between the Jobs. I can do this using SQL for a gap between 0.05 and 0.4 units(of time) as below:
require(sqldf)
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as startime, b.fintime as fintime
from dt a, dt b
where a.fintime < b.starttime - 0.05
and a.fintime > b.starttime - 0.4
")
How can I do this using data.table ? (I am hoping for a performance improvement over sqldf when the data is large and with more constraints)
So here is a data.table approach that is about 20X faster, but there are some caveats (described at the end).
require(data.table)
set.seed(1) # for reproducible example
n <- 100 # simple example
dt <- data.table(pid=sample(1:n,n),
starttime = sample(1:n,n)/n,2)[,fintime:=starttime + round(runif(n)/4,2)]
# sqldf approach
require(sqldf)
f.sql <- function(dt) {
sqldf("create index idx on dt(starttime,fintime)")
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as starttime, b.fintime as fintime
from dt a, dt b
where b.starttime >= a.fintime + 0.05
and b.starttime <= a.fintime + 0.4
")
}
res.sql <- f.sql(dt)
# data.table approach with foverlaps(...): need >= 1.9.4 for this!!
packageVersion("data.table")
# [1] ‘1.9.4’
f.DT <- function(dt) {
lookup <- dt[,list(second=pid, fintime, a=starttime,b=starttime)]
setkey(lookup,a,b)
DT <- dt[,list(first=pid, starttime, a=fintime+0.05,b=fintime+0.4)]
J.olaps <- foverlaps(DT,lookup,type="any",nomatch=0)
J.olaps[,list(first,second,starttime,fintime)]
}
res.DT <- f.DT(dt)
So this uses the foverlaps(...)
function in the newest version of data.table
(1.9.4). Suppose you have two data.tables, x
and y
. Each has a pair of columns that form a range. foverlaps(...)
finds all combinations of records in x
and y
where there is overlap between the range in x
and the range in y
. Here we set this up so that x
has range defined by fintime+0.04
and fintime+0.5
, and y
has range defined by starttime
at both ends. So now foverlaps(...)
looks for any combination of records where the starttime is between 0.04 and 0.5 more than the fintime.
Now for the caveats:
First, this only works (AFAIK) if you are willing to relax your constraints to a closed interval (e.g., b.starttime >= a.fintime + 0.05
, vs. strictly >
).
Second, the data.table approach finds all the records found in the sql approach plus some additional records. You can see this with the following code:
indx <- data.table(first=res.sql$first,second=res.sql$second,key=c("first","second"))
setkey(res.DT,first,second)
extra <- res.DT[!indx,]
The extra records seem like they are legitimate, so the question is: why are they not found by sqldf(...)
? I can't answer that.
Third, this works for your example, but might not be easy to extend with "more constraints".
Finally, here is a "benchmark" with a dataset more similar to your actual data:
set.seed(1)
n <- 1e4 # more realistic example
dt <- data.table(pid=sample(1:n,n),
starttime = sample(1:n,n)/n)[,fintime:=starttime + round(runif(n)/4,2)]
system.time(res.sql <- f.sql(dt))
# user system elapsed
# 45.25 0.53 45.80
system.time(res.DT <- f.DT(dt))
# user system elapsed
# 2.09 0.86 2.94
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments