Search

Search

Splitting out a large file

debugcn Published at Dev

11

Michel Hua

I would like to process a 200 GB file with lines like the following:

...
{"captureTime": "1534303617.738","ua": "..."}
...

The objective is to split this file into multiple files grouped by hours.

Here is my basic script:

#!/bin/sh

echo "Splitting files"

echo "Total lines"
sed -n '$=' $1

echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

while read p; do
  date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '@{}' '+%Y%m%d%H')
  echo $p >> split.$date
done <$1

Some facts:

80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.

Could you help me to optimize this bash script?

Thank you

kvantour

This awk solution might come to your rescue:

awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1

It essentially replaces your while-loop.

Furthermore, you can replace the complete script with:

# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
  print "Total lines processed: ", NR
  print "First date: "strftime("%Y%m%d%H",tmin)
  print "Last date:  "strftime("%Y%m%d%H",tmax)
}

Which you then can run as:

awk -f <awk_file.awk> <jq-file>

Note: the usage of strftime indicates that you need to use GNU awk.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-13

0

Comments

0 comments

Login to comment

Related

From Dev

Splitting a large vector into intervals in R

From Dev

Stream out large datatable to excel file

From Dev

Java - Splitting Large SQL Text File on Delimeter Using Scanner (OutOfMemoryError)

From Dev

Is this the correct way of splitting a large file?

From Dev

Biopython Large Sequence splitting

From Dev

Splitting up a large class with modules

From Dev

Splitting large file by user

From Dev

Perl "out of memory" with large text file

From Dev

Splitting a large log file in to multiple files in Scala

From Dev

Splitting out very large enums in java/groovy

From Dev

Strategy for splitting a large JSON file

From Dev

Splitting large typescript file into module across multiple files

From Dev

FFmpeg splitting large files

From Dev

Splitting a large Pdf file with PDFBox gets large result files

From Dev

Splitting a single large csv file to resample by two columns

From Dev

Splitting large data file in python

From Dev

How to efficiently split up a large text file wihout splitting multiline records?

From Dev

Split large file into chunks without splitting entry

From Dev

Splitting a large VM

From Dev

Splitting a large text file to form a table

From Dev

Splitting two large CSV files preserving relations between file A and B across the resulting files

From Dev

Splitting a very large string in part

From Dev

FileZilla times out when transferring large file

From Dev

Large File - Adding Lines - Out Of Memory

From Dev

Splitting a large file by column with values in the header as file names

From Dev

Splitting a large text file every x pattern repeats

From Dev

Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

From Dev

Writing Large File To Disk Out Of Memory Exception

From Dev

Splitting large html file in several files

Related Related

Article

HotTag

Archive