Removing specific columns from multiple data frames (.tab) and then merging them in R

debugcn Published at Dev

Kshitij Mohan

I have 24 ".tab" files in a folder with names file1.tab, file2.tab, ..... file24.tab. Each of the files is a dataframe with 4 columns and 50,000 rows: The file looks like the image attached-

This is how each of the dataframe file looks like.

The first column is same in all the 24 files, but columns 2,3 and 4 have different values in each of the 24 files. For me, the columns 3 and 4 of each dataframe are irrelevant. I can get rid of the columns in each dataframe individually by following steps :

filenames <- Sys.gob("*.tab")  #reads all the 24 file names
dataframe1 <- read.tab(filenames[1]) 
dataframe1 <- dataframe1[, -c(3,4)] #removes 3rd and 4th column of dataframe

However, this becomes very hectic when I have to repeat the above operation individually on 24 (or more) files which are similar. Is there a way to perform the above operation i.e. removing 3rd and 4th columns from all the 24 files by one code ?

Second part:

After removing the 3rd and 4th columns from each of the 24 files, I want to create a new dataframe which has 25 columns, such that the first column is the Column1 (which is same in all the files) and the subsequent columns are column2 from each of the files.

For two dataframes df1 and df2, I use :

merge(df1,df2,1,1)

and it creates a new data frame. It would be extremely tedious to do the merge operation individually for 24 modified dataframes. Could you please help me?

PS - I tried to find answers to any similar question (if asked before) and could not find it. So, in case it is marked as duplicate, it would be very kind if you please put a link to where it has been answered. I have just started learning R and have no prior experience.

Regards, Kshitij

SebastianP

First lets make a list of fake files

fakefile <- 'a\tb\tc\td
1\t2\t3\t4'

# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))


## 'data.frame':    1 obs. of  4 variables:
##  $ a: int 1
##  $ b: int 2
##  $ c: int 3
##  $ d: int 4

# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)


##  chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...

In principle, all solutions will have the same underlying work as a list and then merge concept (although the merge might be different here and there).

Solution 1 - If you can rely on the order of column 1

If you can rely on the ordering of the columns, then you dont really need to read columns 1 and 4 of each file, but just col 4 and bind them.

# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]

# Reading cols 4 in all files

# We first make a function that does our tasks (reading and removing cols)

reader_fun <- function(x) {
    read.table(text = x, header = TRUE)[,4] 
}

# Then we use lapply to use that function on each elment of our list

cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)


## List of 20
##  $ : int 4
##  $ : int 4
##  $ : int 4
##  $ : int 4


# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)

# And finally add column 1 to it
data.frame(col1, cols4_mat)

##   col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1    1  4  4  4  4  4  4  4  4  4   4   4   4   4   4   4   4   4   4   4
##   X20
## 1   4

Solution 2 - If you can not rely in the order

The implementation is easier but it is a lot slower in most situations

# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)

# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)

# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)


##   a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1   4   4     4     4       4       4         4         4           4
##   d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1           4             4             4               4               4
##   d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1                 4                 4                   4
##   d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1                   4                     4                     4


# you will need to modify the column names btw ...

Bonus - And the most concise solution ...

Depending on how big your data sets are, you might want to ignore the extra columns from the start (instead of reading them and then removing them). You can use fread from the data.table package to do that for you.

reader_function <- function(x) {
    data.table::fread(x, select = c(1,4))
}

my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-12

Comments

0 comments

From Dev

Related Related

Article

Removing specific columns from multiple data frames (.tab) and then merging them in R

Removing specific columns from multiple data frames (.tab) and then merging them in R

R - subtracting multiple columns from multiple columns with 2 data frames

Merging/combining specific columns from separate data frames or objects within a list

Extract columns with same names from multiple data frames [R]

Merging Python data frames on multiple columns with different join types

R: Removing multiple columns from data frame based on vector values

merging list of multiple data frames

Removing attributes of columns in data.frames on multilevel lists in R

Add columns to specific rows in a list of data frames from a data frame in R

Add columns to specific rows in a list of data frames from a data frame in R

Merging rows within a data frame and merging columns between data frames

Merging rows within a data frame and merging columns between data frames

Saving multiple data frames and return them from a function

Removing columns from data frame using R

Python Pandas: Merging data frames on multiple conditions

Python: How to combine specific columns from two different data frames

Merging data frames of different row length in R

Merging a lot of different data frames in R

Merging uneven Panel Data frames in R

R merging data frames containing dates and times

lapply() and spline() on two data frames in R , No Merging

Merging list of data frames, each as a factor R

Merging multiple columns to one in data frame by sum r

R: Merging data frames: Exclude specific column value, but keep skipped rows

how to extract columns in R to make multiple data frames?

how to extract columns in R to make multiple data frames?

Outer join 2 data frames by multiple columns in R

How can I convert the format of columns from multiple data frames?

Python Merge columns from multiple data frames into a single dataframe

Perform calculation for multiple groups with columns from seperate data frames