In R, I should like to extract the people who completed both versions of a test I designed and subsequently administered in two phases (I asked participants for their first and last names).
The problem is that 1. people aren't consistent in using capitals; and 2. some people might share a first name or last name with other people. Thus, 1. I need a case-insensitive search; and 2. I should like to extract a new data frame that lists the first and last names of the first version, and the first and last names of the second version, in order to verify the match (also because someone might use "Tom" in one instance and "Thomas" in another):
df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))
df2 <- data.frame(firstName = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))
The above names should all provide a match for me to verify:
repeatDF <- data.frame(firstName.1 = c("John", "Josef", "Tom", "Huckleberry", "Huckleberry", "Johann", "Johann"),
lastName.1 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Bach"),
firstName.2 = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
lastName.2 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))
Of which I then (probably manually?) approve all but "Johann Pachelbel" and "Pap Finn", as they might match name-wise, but aren't the same person as the one they're matched to.
So far I have tried merge
(see also match two data.frames based on multiple columns) and %in%
, but both methods are case-sensitive and miss out on some matches. I somehow can't get an apply
function to work using grep
(must admit: not very fluent with either of those functions), but also don't know how to take into account both first and last name using grep
? Am I looking in the right direction, or should I use an altogether different function?
Any help would be much appreciated!
PS. There seem to be many, many similar questions, but either for different programmes or not requiring both of my considerations – apologies though if there is indeed already an answer to my question!
This seems to work based on OP's comments and new dataset. I changed df2
slightly so the names are not in the same order in both data frames.
df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))
df2 <- data.frame(firstName = c("John", "josef", "Huck", "Pap", "Johann Sebastian", "Johann", "Thomas"),
lastName = c("Doe", "K", "Finn", "Finn", "Bach", "Pachelbel", "Sawyer"))
get.match <- function(A,B) {
A <- as.list(tolower(A)); B <- as.list(tolower(B))
match.last <- grepl(A$lastName,B$lastName)|grepl(B$lastName,A$lastName)
match.first <- grepl(A$firstName,B$firstName)|grepl(B$firstName,A$firstName)
match.first | match.last
}
indx <- apply(df2,1,function(row) apply(df1,1,get.match,row))
indx
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [4,] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
m.1 <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
result
# firstName lastName firstName lastName
# 1 John Doe John Doe
# 2 Josef K josef K
# 3 Tom Sawyer Thomas Sawyer
# 4 Huckleberry Finn Huck Finn
# 4.1 Huckleberry Finn Pap Finn
# 5 Johann Bach Johann Sebastian Bach
# 5.1 Johann Bach Johann Pachelbel
So this uses an algorithm implemented in get.match(...)
which compares a row of df1
to a row of df2
and returns TRUE
if the first name in either row is present in the first name of the other row or the last name in either row is present in the last name of the other row. The line:
indx <- apply(df2,1,function(row) apply(df1,1,get.match,row))
then creates an indx
matrix where the rows represent rows in df1
and the columns represent rows of df2
and the element is TRUE
if the corresponding rows of df1
and df2
match. This allows for the possibility of more than one match in either df1
or df2
. Finally we convert this indx
matrix to the result
you want using:
m.1 <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
This code extracts all the rows of df1
which have matches in df2
, and then binds that to the corresponding rows from df2
.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments