Fast field extraction with grep

debugcn 投稿 Dev

Atcold

The problem

I have a 32M lines file with the following format

token^Iname^Iurl$

where ^I is the tab escape sequence, and $ is the end-of-line.

I need to get the url corresponding to not more than 10k matches with the field name. What I've done is

# Get second column
cut -f2 <myFile> |
# Find the word and line number
grep -nwi "<matchWord>" |
# Get just the number
cut -f1 -d ':' |
# Not more than 10k
head -n10000

And then, for each entry of the previous output

# Print line number 
sed -n '<number>{p;q}' <myFile>
# Get 3rd field
cut -f3

Now, this last operation with sed is ridiculously slow. I am wondering how to get the all of this by using grep only, or any other way that doesn't slow down after the first 1k matches.

Idea

It would be just perfect to be able to operate grep on the whole line (without cut -f2), targeting only the second column, and then cut -f3, but I don't have a clue of how to do it.

Example

Line xyz

qwertyuiop^Ibananas are yellow^Ihttp://mignons.cool$

Match word yellow in field name -> give me http://mignons.cool.

cut is needed, because I don't want to match stuff in the field token and url.

If I send to grep a cut of myFile, then I no longer have access to the url field, which I am interested in.

Input and expected output

Input file:

mxp4EdOy-IXkuwsuOfs0EQ^Ilegal yellow pad paper^I0/3/3031.jpg$
AeS7tgmlVffBhousr9YY5Q^Ihelicopter parking only sign^I0/3/3032.jpg$
8dl-VixSjG4Y0FpX9f5KHA^Iwritten list ^I0/3/3033.jpg$
XYvKZC3D_JSwlY8SPl-zLQ^Ihelicopter parking only road sign^I0/3/3034.jpg$
xF6zpvpHcmfpHP2MmT2FVg^Irun menu windows programming^I0/3/3035.jpg$
mCJvV2rXOmItLBkMZlyIwQ^Icoffee mug^I0/3/3040.jpg$
ZiobHk_dLsN-Q921KPJUTA^Icarpet^I0/3/3197.jpg$
xFrbGOMfVMl0WeqVAcT27A^Iwater jugs^I0/3/3199.jpg$

where ^I is the tab escape sequence, and $ is the end-of-line.

Match word helicopter.

Expected output (not more than 10k lines):

0/3/3032.jpg
0/3/3034.jpg

Potential solution

Since the url field contains only numbers, I could

cut -f 2,3 <myFile> | grep <matchWord> | cut -f2 | head -n10000

But it would be nicer to grep the second field only...

mikeserv

You probably should not try to cut cut out. In fact, trying to consolidate a pipeline into a single process for the handling of 32M input lines will very likely negatively affect your task's overall completion time. This depends, though, on the kind of computer on which you run the job.

If the machine on which you process your data has multiple processor cores, then, generally speaking, consolidating a task loop to a single process means consolidating the whole job to a single processor core. This may be desirable on systems with only a single processor core, or else if overall CPU time is precious, but, in my experience, it is better to saturate a processor and use all cores concurrently to complete the task sooner.

That said, you definitely can grep just the second field:

grep -E $'\t(.* )?yellow( .*)?\t' <infile

...that pattern will match only strings which occur between two tab characters on a line, and will match only those strings which are bounded on both sides with either a space or one of the field delimiting tabs. With GNU grep you can also add the -max match switch for limiting output to no more than 10K matches. And so...

grep -m10000 -E $'\t(.* )?yellow( .*)?\t' <infile | cut -f3

...would be enough to do the whole job.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-30

コメントを追加

サインイン

分類Dev

Related 関連記事

記事