I have a .csv file that has some values formatted as paragraphs with line breaks or sometimes bullet points.
"STAT","ID","DESC"
"UPD", "1", "Updated"
"CHG", "2", "Changed"
"UPD", "3", "Updated.
Might have to update again"
"UPD", "4", "Updated.
- once
- twice
- thrice
"
"DEL", "5", "unknown"
"DEL", "6", "Deleted
Need to restore"
I need to count the number of records, used awk like below since I knew 2nd column is unique id but returns more than I have. Above text is ofcourse dummy as I'm not allowed to share original but I tried to reflect as closely as I can.
awk '{print $2}' FS="," sample.csv | wc -l
I even printed out first column using awk '{print $1}' to check just the first column values but the output shows the starting parts of new line in paragraphs.
Please let me know if any additional info is needed and I'll update the question.
A way to do this with awk
is
awk -v RS=$'"\n"' 'END {print NR}' sample.csv
RS=$'"\n"'
sets the Record Separator (which is newline by default) to the three-character string "
, newline, "
. This syntax might work only in bash
. This will cause your file to be decomposed into the following records:
1: "STAT","ID","DESC
2: UPD", "1", "Updated
3: CHG", "2", "Changed
4: UPD", "3", "Updated.
Might have to update again
5: UPD", "4", "Updated.
- once
- twice
- thrice
6: DEL", "5", "unknown
7: DEL", "6", "Deleted
Need to restore"
This assumes that there are no trailing blanks in the file.
'END {print NR}'
reads the file until the end and then prints the record number – in other words, the number of records.Text files are generally considered to consist of a sequence of lines, delimited by newline characters or character sequences. And, generally, a “record” in a text file is considered to be one line. But awk
lets you specify a record separator other than newline. Since the quote-newline-quote string appears between each pair of consecutive records in your file, specifying it as your record separator breaks the file into (very nearly) the records that you want.
But the record separator is like the wall between two rooms – it isn’t part of either one. In normal awk
processing, you see records that are lines without newline characters – they are removed. Similarly, in my answer, the quote-newline-quote sequences are removed. But, since there isn’t a record separator before the first record or after the last one, the very first and very last quote characters are not removed.
If you want to process the file, one record at a time, this solution might not be good enough, because the first record and last records are treated differently. I agree (somewhat) with Glenn’s recommendation that, for any serious work, you should use a “proper CSV parser”.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments