Can a SQL non-equi join task (example below) be done (faster and /or neater) using data.table?

6th Published at Dev

6th

I have a data table with 3 columns: pid ( a job) , starttime (time when it starts ) and fintime (time when it ends) as made up below:

require(data.table)

dt <- data.table(pid=sample(1:100,100), starttime = sample(1:100,100)/100)[,fintime:=starttime + round(runif(100)/4,2)]

I need to identify all possible TWO jobs that can be done sequentially but confirming to an acceptable “gap” between the Jobs. I can do this using SQL for a gap between 0.05 and 0.4 units(of time) as below:

require(sqldf)
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as startime,  b.fintime as fintime 
    from  dt a, dt b 
    where  a.fintime < b.starttime - 0.05
       and a.fintime > b.starttime - 0.4
  ")

How can I do this using data.table ? (I am hoping for a performance improvement over sqldf when the data is large and with more constraints)

jlhoward

So here is a data.table approach that is about 20X faster, but there are some caveats (described at the end).

require(data.table)
set.seed(1)  # for reproducible example
n <- 100     # simple example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n,2)[,fintime:=starttime + round(runif(n)/4,2)]
# sqldf approach
require(sqldf)
f.sql <- function(dt) {
  sqldf("create index idx on dt(starttime,fintime)")
  res <- sqldf("select a.pid as first, b.pid as second , a.starttime as starttime,  b.fintime as fintime 
    from  dt a, dt b 
    where b.starttime >= a.fintime + 0.05
      and b.starttime <= a.fintime + 0.4
  ")  
}
res.sql <- f.sql(dt)

# data.table approach with foverlaps(...): need >= 1.9.4 for this!!
packageVersion("data.table")
# [1] ‘1.9.4’
f.DT <- function(dt) {
  lookup <- dt[,list(second=pid, fintime, a=starttime,b=starttime)]
  setkey(lookup,a,b)
  DT      <- dt[,list(first=pid, starttime, a=fintime+0.05,b=fintime+0.4)]
  J.olaps <- foverlaps(DT,lookup,type="any",nomatch=0) 
  J.olaps[,list(first,second,starttime,fintime)]
}
res.DT <- f.DT(dt)

So this uses the foverlaps(...) function in the newest version of data.table (1.9.4). Suppose you have two data.tables, x and y. Each has a pair of columns that form a range. foverlaps(...) finds all combinations of records in x and y where there is overlap between the range in x and the range in y. Here we set this up so that x has range defined by fintime+0.04 and fintime+0.5, and y has range defined by starttime at both ends. So now foverlaps(...) looks for any combination of records where the starttime is between 0.04 and 0.5 more than the fintime.

Now for the caveats:

First, this only works (AFAIK) if you are willing to relax your constraints to a closed interval (e.g., b.starttime >= a.fintime + 0.05, vs. strictly >).

Second, the data.table approach finds all the records found in the sql approach plus some additional records. You can see this with the following code:

indx  <- data.table(first=res.sql$first,second=res.sql$second,key=c("first","second"))
setkey(res.DT,first,second)
extra <-  res.DT[!indx,]

The extra records seem like they are legitimate, so the question is: why are they not found by sqldf(...)? I can't answer that.

Third, this works for your example, but might not be easy to extend with "more constraints".

Finally, here is a "benchmark" with a dataset more similar to your actual data:

set.seed(1)
n <- 1e4   # more realistic example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n)[,fintime:=starttime + round(runif(n)/4,2)]
system.time(res.sql <- f.sql(dt))
#    user  system elapsed 
#   45.25    0.53   45.80 
system.time(res.DT  <- f.DT(dt))
#    user  system elapsed 
#    2.09    0.86    2.94

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-13

Comments

0 comments

From Dev

Can someone give me a real time example with the below temp table and tablevariable example that I found in stackexchange

From Dev

Related Related

Article