I am new R and have a matrix below:
earning
0 Name usd/day update_year usd/day update_year
1 John 52.0 2011 NA NA
2 Mary 44.0 2012 NA NA
3 Nicole 44.5 2000 est. 49.2 2010 est.
4 Cynthia 38.1 2001 est. 44.0 2011
I wanted to clean up the data in R, to only have 3 columns - Name, usd/day and update_year and get something like this:
0 Name usd/day update_year
1 John 52.0 2011
2 Mary 44.0 2012
3 Nicole 49.2 2010
4 Cynthia 44.0 2011
How do I go about doing that in R please?
I don't want to manually combine them and the actual data is more than 100 rows.
This should work. It seems that you want to extract the most recent date (i.e., year) and the highest value for USD. There's a couple of things you need to do.
First, only keep the year in update_year
; it seems that you don't want ' est.' in your final table. We can do that using gsub
.
df$update_year.x <- gsub("[^0-9]", "", df$update_year.x)
df$update_year.y <- gsub("[^0-9]", "", df$update_year.y)
Find the most recent year.
df$update_year <- apply(df[, c(4,6)], 1, max, na.rm=TRUE)
Find the highest USD value.
df$usd.day <- apply(df[, c(3,5)], 1, max, na.rm=TRUE)
Keep relevant columns.
df[, c("Name", "usd.day", "update_year")]
# Name usd.day update_year
#1 John 52.0 2011
#2 Mary 44.0 2012
#3 Nicole 49.2 2010
#4 Cynthia 44.0 2011
Data
df <- read.table(text="
X0 Name usd/day.x update_year.x usd/day.y update_year.y
1 John 52.0 2011 NA NA
2 Mary 44.0 2012 NA NA
3 Nicole 44.5 '2000 est.' 49.2 '2010 est.'
4 Cynthia 38.1 '2001 est.' 44.0 2011", header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
As pointed out in the comments to your answer; there are duplicate column names and that's a problem. I solved that here by adding x/y at the end of the names.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments