Awk comparing 3 values, 2nd file value between 1st file values with multiple column printout between both files to a 3rd

EcoG Published at Dev

ecog

I am trying to make a comparison between two large files, tab delimited. I have been trying to use awk & bash (Ubuntu 15.10), python (v3.5) and powershell (windows 10). My only background is Java but my field tends to stick with the scripting languages.

I am trying to see

File 1 A[ ]

1   gramene gene    4854    9652    .   -   .   ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:C0P8I2];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1   gramene gene    9882    10387   .   -   .   ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1   gramene gene    109519  111769  .   -   .   ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1   gramene gene    136307  138929  .   +   .   ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

File 2 B [ ]

S1_6370 T/C 1   6370    +
S1_8210 T   1   8210    +
S1_8376 A   1   8376    +
S1_9889 A   1   9889    +

Output

1   ID=gene:GRMZM2G059865   4857    9652    -   S1_6370 T/C 6370    +   
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8210 T   8210    +
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8376 A   8376    +
1   ID=gene:GRMZM5G888250   9882    10387   -   S1_9889 A   9889    +

My general logic

loop (until end of A[ ] and B[ ])
if
B[$4]>A[$4] && B[$4]<A[$5]  #if the value in B column 4 is in between the values in A columns 4 & 5.
then
-F”\t” print {A[1], A[9(filtered)], A[$4FS$5], B[$1], B[$2], B[$3], B[$4], B[$5]}   #hopefully reflects awk column calls if the two files were able to have their columns defined that way.
movea++ # to see if the next set of B column 4 values is in between the values in A columns 4 & 5 
else
moveb++ #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

I know this logic doesn’t follow any language that I am aware of but is similar in parts. It seems like NR and FNR are two built in running values in awk. Awk helped me split up File 2 that had 10 values in B[$1] into 10 files quite easily and also cut helped with cutting out the few hundred columns (~255+) beyond the 5 you see here. Now I am working File 2 sizes around a couple MB instead of 1 file of 1.6 GB. Other than cutting down loading times, I wanted to simplify the loops. I haven’t backtracked to my previous attempts of python or powershell since I trimmed the file sizes down. I convinced myself they just weren’t going to read my files with their built in libraries or cmdlets. Which I’ll try sometime soon if I am unable to figure out an awk solution.

comparing multiple files and columns using awk #referenced Awk greater than less than but within a set range #referenced efficiently splitting one file into several files by value of column #the one thing that worked Using awk to get a specific string in line #might be able to filter column 9 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix? #this seemed the closest but without all the printing out in a third file I wanted, still not able to figure out the syntax completely

John1024

Try:

$ awk 'BEGIN{x=getline s <"B"; split(s,b,"\t")} !x{exit} {sub(/;.*/,"",$9); while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}}' OFS='\t' A
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_6370 T/C     6370    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8210 T       8210    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8376 A       8376    +
1       ID=gene:GRMZM5G888250   9882    10387   -       S1_9889 A       9889    +

How it works

This program implicitly loops through the lines of file A.

BEGIN{x=getline s <"B"; split(s,b,"\t")}

Before we start reading file A, read the first line of file B into string s. Split that string up into array b using tabs as the separator.

The function getline will set x to true until we run out of lines to read in file B.
!x{exit}

If we have run out of lines to read in file B, then exit the program.
sub(/;.*/,"",$9)

Remove everything after the ; from field 9 of file A.
while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}

Loop through the lines of file B, printing the requested output as long as the fourth field of line B is between the values of fields 4 and 5 of file A.

The function getline will set x to true until we run out of lines to read in file B.
OFS='\t'

Make the output field separator a tab.

Multi-line version

For those who prefer their awk code split over multiple lines:

awk '

BEGIN{
    x=getline s <"B"
    split(s,b,"\t")
} 

!x {
    exit
} 

{   
    sub(/;.*/,"",$9)
    while (x && $4<b[4] && b[4]<$5) {
        print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]
        x=getline s <"B"; split(s,b,"\t")
    }
}
' OFS='\t' A

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-23

Comments

0 comments

From Dev

AWK Retrieve text after a certain pattern where the 1st and 2nd columns match the values in the 1st and 2nd columns in an input file

From Dev

Java:Three digit Sum - Find out all the numbers between 1 and 999 where the sum of 1st digit and 2nd digit is equal to 3rd digit

From Dev

Related Related

Article