I have 24 ".tab" files in a folder with names file1.tab, file2.tab, ..... file24.tab. Each of the files is a dataframe with 4 columns and 50,000 rows: The file looks like the image attached-
This is how each of the dataframe file looks like.
The first column is same in all the 24 files, but columns 2,3 and 4 have different values in each of the 24 files. For me, the columns 3 and 4 of each dataframe are irrelevant. I can get rid of the columns in each dataframe individually by following steps :
filenames <- Sys.gob("*.tab") #reads all the 24 file names
dataframe1 <- read.tab(filenames[1])
dataframe1 <- dataframe1[, -c(3,4)] #removes 3rd and 4th column of dataframe
However, this becomes very hectic when I have to repeat the above operation individually on 24 (or more) files which are similar. Is there a way to perform the above operation i.e. removing 3rd and 4th columns from all the 24 files by one code ?
Second part:
After removing the 3rd and 4th columns from each of the 24 files, I want to create a new dataframe which has 25 columns, such that the first column is the Column1 (which is same in all the files) and the subsequent columns are column2 from each of the files.
For two dataframes df1 and df2, I use :
merge(df1,df2,1,1)
and it creates a new data frame. It would be extremely tedious to do the merge operation individually for 24 modified dataframes. Could you please help me?
PS - I tried to find answers to any similar question (if asked before) and could not find it. So, in case it is marked as duplicate, it would be very kind if you please put a link to where it has been answered. I have just started learning R and have no prior experience.
Regards, Kshitij
First lets make a list of fake files
fakefile <- 'a\tb\tc\td
1\t2\t3\t4'
# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))
## 'data.frame': 1 obs. of 4 variables:
## $ a: int 1
## $ b: int 2
## $ c: int 3
## $ d: int 4
# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)
## chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...
In principle, all solutions will have the same underlying work as a list and then merge concept (although the merge might be different here and there).
Solution 1 - If you can rely on the order of column 1
If you can rely on the ordering of the columns, then you dont really need to read columns 1 and 4 of each file, but just col 4 and bind them.
# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]
# Reading cols 4 in all files
# We first make a function that does our tasks (reading and removing cols)
reader_fun <- function(x) {
read.table(text = x, header = TRUE)[,4]
}
# Then we use lapply to use that function on each elment of our list
cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)
## List of 20
## $ : int 4
## $ : int 4
## $ : int 4
## $ : int 4
# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)
# And finally add column 1 to it
data.frame(col1, cols4_mat)
## col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## X20
## 1 4
Solution 2 - If you can not rely in the order
The implementation is easier but it is a lot slower in most situations
# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)
# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)
# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
## a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1 4 4 4 4 4 4 4 4 4
## d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1 4 4 4 4 4
## d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1 4 4 4
## d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1 4 4 4
# you will need to modify the column names btw ...
Bonus - And the most concise solution ...
Depending on how big your data sets are, you might want to ignore the extra columns from the start (instead of reading them and then removing them). You can use fread from the data.table package to do that for you.
reader_function <- function(x) {
data.table::fread(x, select = c(1,4))
}
my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments