I have a 32M lines file with the following format
token^Iname^Iurl$
where ^I
is the tab escape sequence, and $
is the end-of-line.
I need to get the url
corresponding to not more than 10k matches with the field name
. What I've done is
# Get second column
cut -f2 <myFile> |
# Find the word and line number
grep -nwi "<matchWord>" |
# Get just the number
cut -f1 -d ':' |
# Not more than 10k
head -n10000
And then, for each entry of the previous output
# Print line number
sed -n '<number>{p;q}' <myFile>
# Get 3rd field
cut -f3
Now, this last operation with sed
is ridiculously slow. I am wondering how to get the all of this by using grep
only, or any other way that doesn't slow down after the first 1k matches.
It would be just perfect to be able to operate grep
on the whole line (without cut -f2
), targeting only the second column, and then cut -f3
, but I don't have a clue of how to do it.
Line xyz
qwertyuiop^Ibananas are yellow^Ihttp://mignons.cool$
Match word yellow in field name
-> give me http://mignons.cool
.
cut
is needed, because I don't want to match stuff in the field token
and url
.
If I send to grep
a cut
of myFile
, then I no longer have access to the url
field, which I am interested in.
Input file:
mxp4EdOy-IXkuwsuOfs0EQ^Ilegal yellow pad paper^I0/3/3031.jpg$
AeS7tgmlVffBhousr9YY5Q^Ihelicopter parking only sign^I0/3/3032.jpg$
8dl-VixSjG4Y0FpX9f5KHA^Iwritten list ^I0/3/3033.jpg$
XYvKZC3D_JSwlY8SPl-zLQ^Ihelicopter parking only road sign^I0/3/3034.jpg$
xF6zpvpHcmfpHP2MmT2FVg^Irun menu windows programming^I0/3/3035.jpg$
mCJvV2rXOmItLBkMZlyIwQ^Icoffee mug^I0/3/3040.jpg$
ZiobHk_dLsN-Q921KPJUTA^Icarpet^I0/3/3197.jpg$
xFrbGOMfVMl0WeqVAcT27A^Iwater jugs^I0/3/3199.jpg$
where ^I
is the tab escape sequence, and $
is the end-of-line.
Match word helicopter
.
Expected output (not more than 10k lines):
0/3/3032.jpg
0/3/3034.jpg
Since the url
field contains only numbers, I could
cut -f 2,3 <myFile> | grep <matchWord> | cut -f2 | head -n10000
But it would be nicer to grep
the second field only...
You probably should not try to cut cut
out. In fact, trying to consolidate a pipeline into a single process for the handling of 32M input lines will very likely negatively affect your task's overall completion time. This depends, though, on the kind of computer on which you run the job.
If the machine on which you process your data has multiple processor cores, then, generally speaking, consolidating a task loop to a single process means consolidating the whole job to a single processor core. This may be desirable on systems with only a single processor core, or else if overall CPU time is precious, but, in my experience, it is better to saturate a processor and use all cores concurrently to complete the task sooner.
That said, you definitely can grep
just the second field:
grep -E $'\t(.* )?yellow( .*)?\t' <infile
...that pattern will match only strings which occur between two tab characters on a line, and will match only those strings which are bounded on both sides with either a space or one of the field delimiting tabs. With GNU grep
you can also add the -m
ax match switch for limiting output to no more than 10K matches. And so...
grep -m10000 -E $'\t(.* )?yellow( .*)?\t' <infile | cut -f3
...would be enough to do the whole job.
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加