R: Case-insensitive matching of a combination of first and last names (i.e. two columns) across two dataframes

rvrvrv Published at Dev

rvrvrv

In R, I should like to extract the people who completed both versions of a test I designed and subsequently administered in two phases (I asked participants for their first and last names).

The problem is that 1. people aren't consistent in using capitals; and 2. some people might share a first name or last name with other people. Thus, 1. I need a case-insensitive search; and 2. I should like to extract a new data frame that lists the first and last names of the first version, and the first and last names of the second version, in order to verify the match (also because someone might use "Tom" in one instance and "Thomas" in another):

df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
                  lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))

df2 <- data.frame(firstName = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
                  lastName = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))

The above names should all provide a match for me to verify:

repeatDF <- data.frame(firstName.1 = c("John", "Josef", "Tom", "Huckleberry", "Huckleberry", "Johann", "Johann"),
                       lastName.1 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Bach"),
                       firstName.2 = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
                       lastName.2 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))

Of which I then (probably manually?) approve all but "Johann Pachelbel" and "Pap Finn", as they might match name-wise, but aren't the same person as the one they're matched to.

So far I have tried merge (see also match two data.frames based on multiple columns) and %in%, but both methods are case-sensitive and miss out on some matches. I somehow can't get an apply function to work using grep (must admit: not very fluent with either of those functions), but also don't know how to take into account both first and last name using grep? Am I looking in the right direction, or should I use an altogether different function?

Any help would be much appreciated!

PS. There seem to be many, many similar questions, but either for different programmes or not requiring both of my considerations – apologies though if there is indeed already an answer to my question!

jlhoward

This seems to work based on OP's comments and new dataset. I changed df2 slightly so the names are not in the same order in both data frames.

df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
                 lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))

df2 <- data.frame(firstName = c("John", "josef", "Huck", "Pap", "Johann Sebastian", "Johann", "Thomas"),
                  lastName = c("Doe", "K", "Finn", "Finn", "Bach", "Pachelbel", "Sawyer"))
get.match <- function(A,B) {
  A <- as.list(tolower(A)); B <- as.list(tolower(B))
  match.last  <- grepl(A$lastName,B$lastName)|grepl(B$lastName,A$lastName)
  match.first <- grepl(A$firstName,B$firstName)|grepl(B$firstName,A$firstName)
  match.first | match.last
} 

indx    <- apply(df2,1,function(row) apply(df1,1,get.match,row))
indx
#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
# [1,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
# [4,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

m.1     <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result  <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
result
#       firstName lastName        firstName  lastName
# 1          John      Doe             John       Doe
# 2         Josef        K            josef         K
# 3           Tom   Sawyer           Thomas    Sawyer
# 4   Huckleberry     Finn             Huck      Finn
# 4.1 Huckleberry     Finn              Pap      Finn
# 5        Johann     Bach Johann Sebastian      Bach
# 5.1      Johann     Bach           Johann Pachelbel

So this uses an algorithm implemented in get.match(...) which compares a row of df1 to a row of df2 and returns TRUE if the first name in either row is present in the first name of the other row or the last name in either row is present in the last name of the other row. The line:

indx    <- apply(df2,1,function(row) apply(df1,1,get.match,row))

then creates an indx matrix where the rows represent rows in df1 and the columns represent rows of df2 and the element is TRUE if the corresponding rows of df1 and df2 match. This allows for the possibility of more than one match in either df1 or df2. Finally we convert this indx matrix to the result you want using:

m.1     <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result  <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))

This code extracts all the rows of df1 which have matches in df2, and then binds that to the corresponding rows from df2.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-13

Comments

0 comments

From Dev

Related Related

Article

R: Case-insensitive matching of a combination of first and last names (i.e. two columns) across two dataframes

R: Case-insensitive matching of a combination of first and last names (i.e. two columns) across two dataframes

filling in columns with matching IDs from two dataframes in R

Combining first two columns and turn it into row names in R data.frame

Can foreign key reference to a table with composite(i.e combination of two columns) key?

Capturing matching and non-matching values between two dataframes in R

Multiplying two different dataframes only for completely matching row and column names

Generate several multiplots using two dataframes with the same column names in R

Join two dataframes and overwrite matching rows [R]

Sum product by row across two dataframes/matrix in r

Applying a function to every combination of two columns in a dataframe using R

Unique constraint on combination of two columns?

Multiply two Pandas dataframes with same shape and same columns names

Matching two dataframes in scala

Merge DataFrames on two columns

Can foreign key reference to a table with composite(i.e combination of two columns) key?

Matching First and Last Name on two different tables

Multiplying two different dataframes only for completely matching row and column names

VBA remove matching first & last names across 2 worksheets

First two rows per combination of two columns

Applying a function to every combination of two columns in a dataframe using R

Merging two dataframes with unequal numbers of rows by two columns names

adding columns to the appropriate rows by matching the first two columns

Merge DataFrames on two columns

Awk matching values of first two columns and printing in blank field

Getting distinct combination of two columns

Quick way of matching data between two dataframes [R]

Case insensitive matching for pandas dataframe columns

Partial matching of elements in two string columns in R

How do I find the difference between two values in different dataframes across multiple rows and columns?

Give names to each two columns in R