left outer join in R with conditions

MikeTexnik Published at Dev

MikeTexnik

Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?

Example: There are two data frames df1 and df2 with columns x, y, num. I would like to have a data frame with all rows from df1, but with only those rows from df2 which satisfy the conditions: df1$x == df2$x OR df2$y == df2y.

Here are sample data:

df1 <- data.frame(x = LETTERS[1:5],
                  y = 1:5,
                  num = rnorm(5), stringsAsFactors = F)
df1
  x y       num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559

df2 <- data.frame(x = LETTERS[3:7],
                  y = 3:7,
                  num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
     x  y        num
1    C NA -0.7160824
2 <NA>  4 -0.3283618
3    E  5 -1.8775298
4    F  6 -0.9821082
5    G  7  1.8726288

Then, the result is expected to be:

  x y       num    x  y        num
1 A 1 0.4209480 <NA> NA         NA
2 B 2 0.4687401 <NA> NA         NA
3 C 3 0.3018787    C NA -0.7160824
4 D 4 0.0669793 <NA>  4 -0.3283618
5 E 5 0.9231559    E  5 -1.8775298

The most obvious solution is to use the sqldf package:

mergedData <- sqldf::sqldf("SELECT * FROM df1
                           LEFT OUTER JOIN df2
                           ON df1.x = df2.x
                           OR df1.y = df2.y")

Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.

Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?

Arun

Here's one approach using data.table. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).

require(data.table)
setDT(df1)
setDT(df2)

foo <- function(dx, dy, cols) {
    ix = lapply(cols, function(col) {
        dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
                                   # by matching on column specified in "col"
    })
    ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
#    x y         num col1 col2       col3
# 1: A 1  2.09611034   NA   NA         NA
# 2: B 2 -1.06795571   NA   NA         NA
# 3: C 3  1.38254433    C    3  1.0173476
# 4: D 4 -0.09367922    D    4 -0.6379496
# 5: E 5  0.47552072    E   NA -0.1962038

You can use setDF(df1) to convert it back to a data.frame, if necessary.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-03-3

Comments

0 comments

From Dev

Related Related

Article

left outer join in R with conditions

left outer join in R with conditions

Left Outer Join with Complex conditions

Left Outer Join with Complex conditions

Laravel 8 - Left Outer Join with multiple conditions

mysql left outer join with two conditions

Linq outer join with conditions

LEFT OUTER JOIN with LIMIT

Left Outer Join SOQL

Outer apply and left join

Django Left Outer Join

LEFT OUTER JOIN problems

Left Outer Join Issue

LEFT (OUTER) JOIN

SQL JOIN and LEFT OUTER JOIN

Combined Left Outer Join and Full Outer Join

Right Outer Join to Left Outer join

Left Join with multiple conditions

LEFT JOIN with conditions on where

Left Join with multiple conditions

LEFT JOIN selecting conditions

SQL Left Join Conditions

LEFT OUTER JOIN in Rails 4

MySQL "Left Outer Join" Issue

grails hql left outer join

Left outer join on multiple tables

Equivalent to left outer join in SPARK

Left Outer Join in SQL Server

Django Custom Left Outer Join

Left Outer join in Ebean query

Left outer join on aggregate queries