Unable to get accurate record count of a csv file

NightOwl Skeptic

I have a .csv file that has some values formatted as paragraphs with line breaks or sometimes bullet points.

"STAT","ID","DESC"
"UPD", "1", "Updated"
"CHG", "2", "Changed"
"UPD", "3", "Updated.
Might have to update again"
"UPD", "4", "Updated.

 - once 
 - twice
 - thrice
"
"DEL", "5", "unknown"
"DEL", "6", "Deleted
Need to restore"

I need to count the number of records, used awk like below since I knew 2nd column is unique id but returns more than I have. Above text is ofcourse dummy as I'm not allowed to share original but I tried to reflect as closely as I can.

 awk  '{print $2}' FS=","  sample.csv | wc -l 

I even printed out first column using awk '{print $1}' to check just the first column values but the output shows the starting parts of new line in paragraphs.

Please let me know if any additional info is needed and I'll update the question.

G-Man Says 'Reinstate Monica'

A way to do this with awk is

awk -v RS=$'"\n"' 'END {print NR}' sample.csv
  • RS=$'"\n"' sets the Record Separator (which is newline by default) to the three-character string ", newline, ".  This syntax might work only in bash.  This will cause your file to be decomposed into the following records:

    1: "STAT","ID","DESC

    2: UPD", "1", "Updated

    3: CHG", "2", "Changed

    4: UPD", "3", "Updated.
        Might have to update again

    5: UPD", "4", "Updated.
       
        - once
        - twice
        - thrice
       

    6: DEL", "5", "unknown

    7: DEL", "6", "Deleted
        Need to restore"

    This assumes that there are no trailing blanks in the file. 

  • 'END {print NR}' reads the file until the end and then prints the record number – in other words, the number of records.

Text files are generally considered to consist of a sequence of lines, delimited by newline characters or character sequences.  And, generally, a “record” in a text file is considered to be one line.  But awk lets you specify a record separator other than newline.  Since the quote-newline-quote string appears between each pair of consecutive records in your file, specifying it as your record separator breaks the file into (very nearly) the records that you want.

But the record separator is like the wall between two rooms – it isn’t part of either one.  In normal awk processing, you see records that are lines without newline characters – they are removed.  Similarly, in my answer, the quote-newline-quote sequences are removed.  But, since there isn’t a record separator before the first record or after the last one, the very first and very last quote characters are not removed.

If you want to process the file, one record at a time, this solution might not be good enough, because the first record and last records are treated differently.  I agree (somewhat) with Glenn’s recommendation that, for any serious work, you should use a “proper CSV parser”.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How to get the length of a record from a csv file?

From Dev

Unable to get an accurate threshold of an image with a bright spot

From Dev

How to get accurate path of the uploaded file in jsp?

From Dev

Get count of all commas in a each line from csv file

From Dev

Get Unique Column and Count from CSV file in Powershell

From Dev

Get record count in Azure DocumentDb

From Dev

Get count for each record mysql

From Dev

Unable to get the accurate value of shell variable set inside a pipe

From Dev

Unable to get accurate results from Browserlocation API with curl

From Dev

Row count in a csv file

From Dev

fast, accurate, reliable way to remove undesirable values from a csv file

From Dev

Unable to SaveAs CSV file

From Dev

Export database record as CSV file

From Dev

Finding a specific record in a csv file

From Dev

Get raw record in Apache CSV

From Dev

Why is count not accurate?

From Dev

PySpark distinct().count() on a csv file

From Dev

Count the elements in a csv file with python

From Dev

Count based on other csv file

From Dev

MySQL get date of record where count was achieved

From Dev

SQL COUNT - Advance, get the total of each record

From Dev

How to get all month record count in laravel

From Dev

LINQ query to get count of joined record

From Dev

Unable to write a csv file in spark

From Dev

Unable to lowercase the header of csv file

From Dev

Unable to Get CSV as OutPut - PHP

From Dev

How to get record count if count is already used in group by function

From Dev

Move file depending on row count of csv file

From Dev

Python exract unique record from csv file