I am looking at data on butterflies that have been caught in different samples.My problem is that there has been inconsistency in the 'names' (numbers) used for the same species. The species have each been assigned a number to identify them.
I have two dataframes, the first is a dataset of counts of each species "mydata" but each species in it has been assigned multiple IDs, instead of just one correct one. So two different numbers may refer to the same species, and I need to make sure my names are standardised.
IDs <- c(10,8,3,42,7,23,42,2)
sample1 <- c(0,0,2,0,3,0,0,2)
sample2 <- c(0,1,0,2,4,0,3,1)
sample3 <- c(0,1,1,0,2,0,3,1)
sample4 <- c(0,2,0,2,0,1,2,1)
sample5 <- c(3,1,0,0,1,0,0,1)
mydata <- cbind(IDs,sample1,sample2,sample3,sample4,sample5)
I have a second database that I am using as a reference, "specieslist", and this contains the correct ID, plus all alternative IDs that may have been used.
ID1 <- c(10,34,20,2,7,38)
ID2 <- c(22,3,42,NA,6,23)
ID3 <- c(NA,8,NA,NA,1,NA)
correct.ID <- c(10,3,20,2,1,23)
specieslist <- cbind(ID1,ID2,ID3,correct.ID)
splist <- replace(specieslist,is.na(specieslist),0)
I want to search specieslist to find out which number should be used in mydata, and assign the correct ID to a new column in mydata.
I have been trying to create a loop that will find out which row of specieslist contains the value in mydata, and then selecting the value in the correctID column for that row.
corr.sp <- c(NULL)
rws <- length(mydata[,1])
for(s in 1:rws){
dat <- as.character(mydata[s,1])
pos <- which(splist==dat, arr.ind=TRUE)
ind <- pos[1,1]
corr <- as.matrix(splist[ind,4])
corr.sp <- c(corr.sp,corr)
}
mydata.corrsps <- cbind(mydata,corr.sp)
What I expect is for corr.sp and mydata.corrsps to look like this:
corr.sp <- c(10,3,3,20,1,23,20,2)
mydata.corrsps <- cbind(mydata,corr.sp)
This demo code seems to work, but in some of my real data my an error appears when I run the loop saying my row index (pos[1,1]) is out of bounds - I've had this error before when it searches for rows of species that weren't found in that dataset, but I have been through and removed any rows where this applies, saved the file as a csv and reimported it to avoid errors of row-index mix-ups (seems to happen with data when subsetting in r). I have also checked that the maximum value for the pos(1,1) does not exceed the number of rows available for selection, and I have checked that all values it searches for are present in the data.
I would be very grateful if anyone could suggest a better way of doing what I am unsuccessfully trying to do, or point out where I am going wrong.
You could make splist
long format, and then merge the relevant columns with mydata
:
library(tidyr)
library(dplyr)
# splist to long format
long.splist <- data.frame(splist) %>% gather(key, IDs, ID1:ID3)
# merge
merge(mydata,long.splist[,c(3,1)])
# IDs sample1 sample2 sample3 sample4 sample5 correct.ID
#1 2 2 1 1 1 1 2
#2 3 2 0 1 0 0 3
#3 7 3 4 2 0 1 1
#4 8 0 1 1 2 1 3
#5 10 0 0 0 0 3 10
#6 23 0 0 0 1 0 23
#7 42 0 2 0 2 0 20
#8 42 0 3 3 2 0 20
The result is ordered by IDs
, as that's the column on which the join was performed.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments