R: Case-insensitive matching of a combination of first and last names (i.e. two columns) across two dataframes

rvrvrv

In R, I should like to extract the people who completed both versions of a test I designed and subsequently administered in two phases (I asked participants for their first and last names).

The problem is that 1. people aren't consistent in using capitals; and 2. some people might share a first name or last name with other people. Thus, 1. I need a case-insensitive search; and 2. I should like to extract a new data frame that lists the first and last names of the first version, and the first and last names of the second version, in order to verify the match (also because someone might use "Tom" in one instance and "Thomas" in another):

df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
                  lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))

df2 <- data.frame(firstName = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
                  lastName = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))

The above names should all provide a match for me to verify:

repeatDF <- data.frame(firstName.1 = c("John", "Josef", "Tom", "Huckleberry", "Huckleberry", "Johann", "Johann"),
                       lastName.1 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Bach"),
                       firstName.2 = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
                       lastName.2 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))

Of which I then (probably manually?) approve all but "Johann Pachelbel" and "Pap Finn", as they might match name-wise, but aren't the same person as the one they're matched to.

So far I have tried merge (see also match two data.frames based on multiple columns) and %in%, but both methods are case-sensitive and miss out on some matches. I somehow can't get an apply function to work using grep (must admit: not very fluent with either of those functions), but also don't know how to take into account both first and last name using grep? Am I looking in the right direction, or should I use an altogether different function?

Any help would be much appreciated!

PS. There seem to be many, many similar questions, but either for different programmes or not requiring both of my considerations – apologies though if there is indeed already an answer to my question!

jlhoward

This seems to work based on OP's comments and new dataset. I changed df2 slightly so the names are not in the same order in both data frames.

df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
                 lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))

df2 <- data.frame(firstName = c("John", "josef", "Huck", "Pap", "Johann Sebastian", "Johann", "Thomas"),
                  lastName = c("Doe", "K", "Finn", "Finn", "Bach", "Pachelbel", "Sawyer"))
get.match <- function(A,B) {
  A <- as.list(tolower(A)); B <- as.list(tolower(B))
  match.last  <- grepl(A$lastName,B$lastName)|grepl(B$lastName,A$lastName)
  match.first <- grepl(A$firstName,B$firstName)|grepl(B$firstName,A$firstName)
  match.first | match.last
} 

indx    <- apply(df2,1,function(row) apply(df1,1,get.match,row))
indx
#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
# [1,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
# [4,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

m.1     <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result  <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
result
#       firstName lastName        firstName  lastName
# 1          John      Doe             John       Doe
# 2         Josef        K            josef         K
# 3           Tom   Sawyer           Thomas    Sawyer
# 4   Huckleberry     Finn             Huck      Finn
# 4.1 Huckleberry     Finn              Pap      Finn
# 5        Johann     Bach Johann Sebastian      Bach
# 5.1      Johann     Bach           Johann Pachelbel

So this uses an algorithm implemented in get.match(...) which compares a row of df1 to a row of df2 and returns TRUE if the first name in either row is present in the first name of the other row or the last name in either row is present in the last name of the other row. The line:

indx    <- apply(df2,1,function(row) apply(df1,1,get.match,row))

then creates an indx matrix where the rows represent rows in df1 and the columns represent rows of df2 and the element is TRUE if the corresponding rows of df1 and df2 match. This allows for the possibility of more than one match in either df1 or df2. Finally we convert this indx matrix to the result you want using:

m.1     <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result  <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))

This code extracts all the rows of df1 which have matches in df2, and then binds that to the corresponding rows from df2.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

filling in columns with matching IDs from two dataframes in R

From Dev

Combining first two columns and turn it into row names in R data.frame

From Dev

Can foreign key reference to a table with composite(i.e combination of two columns) key?

From Dev

Capturing matching and non-matching values between two dataframes in R

From Dev

Multiplying two different dataframes only for completely matching row and column names

From Dev

Generate several multiplots using two dataframes with the same column names in R

From Dev

Join two dataframes and overwrite matching rows [R]

From Dev

Sum product by row across two dataframes/matrix in r

From Dev

Applying a function to every combination of two columns in a dataframe using R

From Dev

Unique constraint on combination of two columns?

From Dev

Multiply two Pandas dataframes with same shape and same columns names

From Dev

Matching two dataframes in scala

From Dev

Merge DataFrames on two columns

From Dev

Can foreign key reference to a table with composite(i.e combination of two columns) key?

From Dev

Matching First and Last Name on two different tables

From Dev

Multiplying two different dataframes only for completely matching row and column names

From Dev

VBA remove matching first & last names across 2 worksheets

From Dev

First two rows per combination of two columns

From Dev

Applying a function to every combination of two columns in a dataframe using R

From Dev

Merging two dataframes with unequal numbers of rows by two columns names

From Dev

adding columns to the appropriate rows by matching the first two columns

From Dev

Merge DataFrames on two columns

From Dev

Awk matching values of first two columns and printing in blank field

From Dev

Getting distinct combination of two columns

From Dev

Quick way of matching data between two dataframes [R]

From Dev

Case insensitive matching for pandas dataframe columns

From Dev

Partial matching of elements in two string columns in R

From Dev

How do I find the difference between two values in different dataframes across multiple rows and columns?

From Dev

Give names to each two columns in R

Related Related

  1. 1

    filling in columns with matching IDs from two dataframes in R

  2. 2

    Combining first two columns and turn it into row names in R data.frame

  3. 3

    Can foreign key reference to a table with composite(i.e combination of two columns) key?

  4. 4

    Capturing matching and non-matching values between two dataframes in R

  5. 5

    Multiplying two different dataframes only for completely matching row and column names

  6. 6

    Generate several multiplots using two dataframes with the same column names in R

  7. 7

    Join two dataframes and overwrite matching rows [R]

  8. 8

    Sum product by row across two dataframes/matrix in r

  9. 9

    Applying a function to every combination of two columns in a dataframe using R

  10. 10

    Unique constraint on combination of two columns?

  11. 11

    Multiply two Pandas dataframes with same shape and same columns names

  12. 12

    Matching two dataframes in scala

  13. 13

    Merge DataFrames on two columns

  14. 14

    Can foreign key reference to a table with composite(i.e combination of two columns) key?

  15. 15

    Matching First and Last Name on two different tables

  16. 16

    Multiplying two different dataframes only for completely matching row and column names

  17. 17

    VBA remove matching first & last names across 2 worksheets

  18. 18

    First two rows per combination of two columns

  19. 19

    Applying a function to every combination of two columns in a dataframe using R

  20. 20

    Merging two dataframes with unequal numbers of rows by two columns names

  21. 21

    adding columns to the appropriate rows by matching the first two columns

  22. 22

    Merge DataFrames on two columns

  23. 23

    Awk matching values of first two columns and printing in blank field

  24. 24

    Getting distinct combination of two columns

  25. 25

    Quick way of matching data between two dataframes [R]

  26. 26

    Case insensitive matching for pandas dataframe columns

  27. 27

    Partial matching of elements in two string columns in R

  28. 28

    How do I find the difference between two values in different dataframes across multiple rows and columns?

  29. 29

    Give names to each two columns in R

HotTag

Archive