Splitting out a large file

Michel Hua

I would like to process a 200 GB file with lines like the following:

...
{"captureTime": "1534303617.738","ua": "..."}
...

The objective is to split this file into multiple files grouped by hours.

Here is my basic script:

#!/bin/sh

echo "Splitting files"

echo "Total lines"
sed -n '$=' $1

echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

while read p; do
  date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '@{}' '+%Y%m%d%H')
  echo $p >> split.$date
done <$1 

Some facts:

  • 80 000 000 lines to process
  • jq doesn't work well since some JSON lines are invalid.

Could you help me to optimize this bash script?

Thank you

kvantour

This awk solution might come to your rescue:

awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1

It essentially replaces your while-loop.

Furthermore, you can replace the complete script with:

# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
  print "Total lines processed: ", NR
  print "First date: "strftime("%Y%m%d%H",tmin)
  print "Last date:  "strftime("%Y%m%d%H",tmax)
}

Which you then can run as:

awk -f <awk_file.awk> <jq-file>

Note: the usage of strftime indicates that you need to use GNU awk.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Splitting a large vector into intervals in R

From Dev

Stream out large datatable to excel file

From Dev

Java - Splitting Large SQL Text File on Delimeter Using Scanner (OutOfMemoryError)

From Dev

Is this the correct way of splitting a large file?

From Dev

Biopython Large Sequence splitting

From Dev

Splitting up a large class with modules

From Dev

Splitting large file by user

From Dev

Perl "out of memory" with large text file

From Dev

Splitting a large log file in to multiple files in Scala

From Dev

Splitting out very large enums in java/groovy

From Dev

Strategy for splitting a large JSON file

From Dev

Splitting large typescript file into module across multiple files

From Dev

FFmpeg splitting large files

From Dev

Splitting a large Pdf file with PDFBox gets large result files

From Dev

Splitting a single large csv file to resample by two columns

From Dev

Splitting large data file in python

From Dev

How to efficiently split up a large text file wihout splitting multiline records?

From Dev

Split large file into chunks without splitting entry

From Dev

Splitting a large VM

From Dev

Splitting a large text file to form a table

From Dev

Splitting two large CSV files preserving relations between file A and B across the resulting files

From Dev

Splitting a very large string in part

From Dev

FileZilla times out when transferring large file

From Dev

Large File - Adding Lines - Out Of Memory

From Dev

Splitting a large file by column with values in the header as file names

From Dev

Splitting a large text file every x pattern repeats

From Dev

Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

From Dev

Writing Large File To Disk Out Of Memory Exception

From Dev

Splitting large html file in several files

Related Related

  1. 1

    Splitting a large vector into intervals in R

  2. 2

    Stream out large datatable to excel file

  3. 3

    Java - Splitting Large SQL Text File on Delimeter Using Scanner (OutOfMemoryError)

  4. 4

    Is this the correct way of splitting a large file?

  5. 5

    Biopython Large Sequence splitting

  6. 6

    Splitting up a large class with modules

  7. 7

    Splitting large file by user

  8. 8

    Perl "out of memory" with large text file

  9. 9

    Splitting a large log file in to multiple files in Scala

  10. 10

    Splitting out very large enums in java/groovy

  11. 11

    Strategy for splitting a large JSON file

  12. 12

    Splitting large typescript file into module across multiple files

  13. 13

    FFmpeg splitting large files

  14. 14

    Splitting a large Pdf file with PDFBox gets large result files

  15. 15

    Splitting a single large csv file to resample by two columns

  16. 16

    Splitting large data file in python

  17. 17

    How to efficiently split up a large text file wihout splitting multiline records?

  18. 18

    Split large file into chunks without splitting entry

  19. 19

    Splitting a large VM

  20. 20

    Splitting a large text file to form a table

  21. 21

    Splitting two large CSV files preserving relations between file A and B across the resulting files

  22. 22

    Splitting a very large string in part

  23. 23

    FileZilla times out when transferring large file

  24. 24

    Large File - Adding Lines - Out Of Memory

  25. 25

    Splitting a large file by column with values in the header as file names

  26. 26

    Splitting a large text file every x pattern repeats

  27. 27

    Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

  28. 28

    Writing Large File To Disk Out Of Memory Exception

  29. 29

    Splitting large html file in several files

HotTag

Archive