R data.table fread command : how to read large files with irregular separators?

fxi

I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.

Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

The first 2 lines and 7 rows of my data look like that :

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.

I've tried to use a command like this to convert spaces in comma :

DT <- fread(
            paste("sed 's/\\s\\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

without success : the problem remains and it seems to be slow with the sed command.

Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?

Here is a (maybe) smallest reproducible example (newline char after 40790) :

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

Thanks for your help !

UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.

UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().

NeronLeVelu
sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g' 

for you sed

it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How to read data with different separators?

From Dev

How to read data with different separators?

From Dev

R data.table fread - read column as Date

From Dev

fread data.table in R doesn't read in column names

From Dev

How to read file with irregular space separated value using fread()?

From Dev

How to read tab separated file into data.table using fread?

From Dev

How to read numeric header in data.table with the fread function?

From Dev

R: How can I read a CSV file with data.table::fread, that has a comma as decimal and point as thousand separator="."

From Dev

Read an irregular text data file into R

From Dev

R workarounds: fread versus read.table

From Dev

How to read .csv-data containing thousand separators and special handling of zeros (in R)?

From Dev

How can I cut large csv files using any R packages like ff or data.table?

From Dev

R data.table fread suppress messages

From Dev

R data.table fread from clipboard

From Dev

R fread data.table inconsistent speed

From Dev

R data.table fread from clipboard

From Dev

How to read files with separators in Python and append characters at the end?

From Dev

read file with single line into R in fread or read.table

From Dev

R: How to quickly read large .dta files without RAM Limitations

From Dev

How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

From Dev

fread in R imports a large .csv file as a data frame with one row

From Dev

fread in R imports a large .csv file as a data frame with one row

From Dev

How to read data from CSV if contains more than excepted separators?

From Dev

Quick Read and Merge with Data.Table's Fread and Rbindlist

From Dev

Read only one line with fread from data.table

From Dev

How to read a data table into R as a matrix

From Dev

Fast reading and combining several files using data.table (with fread)

From Dev

Importing irregular data in r

From Dev

When I read in a large table using fread it slightly changes the numbers in one of the columns

Related Related

  1. 1

    How to read data with different separators?

  2. 2

    How to read data with different separators?

  3. 3

    R data.table fread - read column as Date

  4. 4

    fread data.table in R doesn't read in column names

  5. 5

    How to read file with irregular space separated value using fread()?

  6. 6

    How to read tab separated file into data.table using fread?

  7. 7

    How to read numeric header in data.table with the fread function?

  8. 8

    R: How can I read a CSV file with data.table::fread, that has a comma as decimal and point as thousand separator="."

  9. 9

    Read an irregular text data file into R

  10. 10

    R workarounds: fread versus read.table

  11. 11

    How to read .csv-data containing thousand separators and special handling of zeros (in R)?

  12. 12

    How can I cut large csv files using any R packages like ff or data.table?

  13. 13

    R data.table fread suppress messages

  14. 14

    R data.table fread from clipboard

  15. 15

    R fread data.table inconsistent speed

  16. 16

    R data.table fread from clipboard

  17. 17

    How to read files with separators in Python and append characters at the end?

  18. 18

    read file with single line into R in fread or read.table

  19. 19

    R: How to quickly read large .dta files without RAM Limitations

  20. 20

    How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

  21. 21

    fread in R imports a large .csv file as a data frame with one row

  22. 22

    fread in R imports a large .csv file as a data frame with one row

  23. 23

    How to read data from CSV if contains more than excepted separators?

  24. 24

    Quick Read and Merge with Data.Table's Fread and Rbindlist

  25. 25

    Read only one line with fread from data.table

  26. 26

    How to read a data table into R as a matrix

  27. 27

    Fast reading and combining several files using data.table (with fread)

  28. 28

    Importing irregular data in r

  29. 29

    When I read in a large table using fread it slightly changes the numbers in one of the columns

HotTag

Archive