Can a SQL non-equi join task (example below) be done (faster and /or neater) using data.table?

6th

I have a data table with 3 columns: pid ( a job) , starttime (time when it starts ) and fintime (time when it ends) as made up below:

require(data.table)

dt <- data.table(pid=sample(1:100,100), starttime = sample(1:100,100)/100)[,fintime:=starttime + round(runif(100)/4,2)] 

I need to identify all possible TWO jobs that can be done sequentially but confirming to an acceptable “gap” between the Jobs. I can do this using SQL for a gap between 0.05 and 0.4 units(of time) as below:

require(sqldf)
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as startime,  b.fintime as fintime 
    from  dt a, dt b 
    where  a.fintime < b.starttime - 0.05
       and a.fintime > b.starttime - 0.4
  ")

How can I do this using data.table ? (I am hoping for a performance improvement over sqldf when the data is large and with more constraints)

jlhoward

So here is a data.table approach that is about 20X faster, but there are some caveats (described at the end).

require(data.table)
set.seed(1)  # for reproducible example
n <- 100     # simple example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n,2)[,fintime:=starttime + round(runif(n)/4,2)]
# sqldf approach
require(sqldf)
f.sql <- function(dt) {
  sqldf("create index idx on dt(starttime,fintime)")
  res <- sqldf("select a.pid as first, b.pid as second , a.starttime as starttime,  b.fintime as fintime 
    from  dt a, dt b 
    where b.starttime >= a.fintime + 0.05
      and b.starttime <= a.fintime + 0.4
  ")  
}
res.sql <- f.sql(dt)

# data.table approach with foverlaps(...): need >= 1.9.4 for this!!
packageVersion("data.table")
# [1] ‘1.9.4’
f.DT <- function(dt) {
  lookup <- dt[,list(second=pid, fintime, a=starttime,b=starttime)]
  setkey(lookup,a,b)
  DT      <- dt[,list(first=pid, starttime, a=fintime+0.05,b=fintime+0.4)]
  J.olaps <- foverlaps(DT,lookup,type="any",nomatch=0) 
  J.olaps[,list(first,second,starttime,fintime)]
}
res.DT <- f.DT(dt)

So this uses the foverlaps(...) function in the newest version of data.table (1.9.4). Suppose you have two data.tables, x and y. Each has a pair of columns that form a range. foverlaps(...) finds all combinations of records in x and y where there is overlap between the range in x and the range in y. Here we set this up so that x has range defined by fintime+0.04 and fintime+0.5, and y has range defined by starttime at both ends. So now foverlaps(...) looks for any combination of records where the starttime is between 0.04 and 0.5 more than the fintime.

Now for the caveats:

First, this only works (AFAIK) if you are willing to relax your constraints to a closed interval (e.g., b.starttime >= a.fintime + 0.05, vs. strictly >).

Second, the data.table approach finds all the records found in the sql approach plus some additional records. You can see this with the following code:

indx  <- data.table(first=res.sql$first,second=res.sql$second,key=c("first","second"))
setkey(res.DT,first,second)
extra <-  res.DT[!indx,]

The extra records seem like they are legitimate, so the question is: why are they not found by sqldf(...)? I can't answer that.

Third, this works for your example, but might not be easy to extend with "more constraints".

Finally, here is a "benchmark" with a dataset more similar to your actual data:

set.seed(1)
n <- 1e4   # more realistic example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n)[,fintime:=starttime + round(runif(n)/4,2)]
system.time(res.sql <- f.sql(dt))
#    user  system elapsed 
#   45.25    0.53   45.80 
system.time(res.DT  <- f.DT(dt))
#    user  system elapsed 
#    2.09    0.86    2.94 

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Equi-join and rolling join in data.table

From Dev

What is a valid use case for using a Non-Equi Join? >, >=, <, <=, <>

From Dev

left non equi join linq

From Java

how can I make a detailed data.table neater?

From Dev

Can I make this dplyr + data.table task faster?

From Dev

how would i get it to show me records in one table that arent present in another table using an EQUI JOIN?

From Dev

Why is intersect(...) faster than data table join?

From Dev

Can this be done faster with numpy?

From Dev

Example Criteria Hibernate Multiple Join/Table SQL

From Dev

Updating a table using SQL function (data dependent on source) in the INNER JOIN

From Dev

SQL How to Delete Data from Table Using INNER JOIN

From Dev

Solving Equi task in python

From Dev

Querying the data from sql table using non key column

From Dev

Are there faster alternatives to the CROSS JOIN model used below?

From Dev

Can someone give me a real time example with the below temp table and tablevariable example that I found in stackexchange

From Dev

BeautifulSoup takes forever, can this be done faster?

From Dev

Can C# execution of Python script be done faster by re-using process?

From Dev

Can C# execution of Python script be done faster by re-using process?

From Dev

How to join three table to check the data in sql?

From Dev

SQL Join with nested table returns wrong data

From Dev

SQL combine and join row data with same Table

From Dev

How to perform equi join in mongodb?

From Dev

Can I join a table to a list using linq?

From Dev

Using SQL to join tables with a junction table

From Java

Update a table using JOIN in SQL Server?

From Dev

Output in SQL Table : without using JOIN

From Dev

Create Table using inner join in sql

From Dev

Create Table using inner join in sql

From Dev

SQL calculate percentage using table self join

Related Related

  1. 1

    Equi-join and rolling join in data.table

  2. 2

    What is a valid use case for using a Non-Equi Join? >, >=, <, <=, <>

  3. 3

    left non equi join linq

  4. 4

    how can I make a detailed data.table neater?

  5. 5

    Can I make this dplyr + data.table task faster?

  6. 6

    how would i get it to show me records in one table that arent present in another table using an EQUI JOIN?

  7. 7

    Why is intersect(...) faster than data table join?

  8. 8

    Can this be done faster with numpy?

  9. 9

    Example Criteria Hibernate Multiple Join/Table SQL

  10. 10

    Updating a table using SQL function (data dependent on source) in the INNER JOIN

  11. 11

    SQL How to Delete Data from Table Using INNER JOIN

  12. 12

    Solving Equi task in python

  13. 13

    Querying the data from sql table using non key column

  14. 14

    Are there faster alternatives to the CROSS JOIN model used below?

  15. 15

    Can someone give me a real time example with the below temp table and tablevariable example that I found in stackexchange

  16. 16

    BeautifulSoup takes forever, can this be done faster?

  17. 17

    Can C# execution of Python script be done faster by re-using process?

  18. 18

    Can C# execution of Python script be done faster by re-using process?

  19. 19

    How to join three table to check the data in sql?

  20. 20

    SQL Join with nested table returns wrong data

  21. 21

    SQL combine and join row data with same Table

  22. 22

    How to perform equi join in mongodb?

  23. 23

    Can I join a table to a list using linq?

  24. 24

    Using SQL to join tables with a junction table

  25. 25

    Update a table using JOIN in SQL Server?

  26. 26

    Output in SQL Table : without using JOIN

  27. 27

    Create Table using inner join in sql

  28. 28

    Create Table using inner join in sql

  29. 29

    SQL calculate percentage using table self join

HotTag

Archive