Awk comparing 3 values, 2nd file value between 1st file values with multiple column printout between both files to a 3rd

ecog

I am trying to make a comparison between two large files, tab delimited. I have been trying to use awk & bash (Ubuntu 15.10), python (v3.5) and powershell (windows 10). My only background is Java but my field tends to stick with the scripting languages.

I am trying to see

File 1 A[ ]

1   gramene gene    4854    9652    .   -   .   ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:C0P8I2];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1   gramene gene    9882    10387   .   -   .   ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1   gramene gene    109519  111769  .   -   .   ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1   gramene gene    136307  138929  .   +   .   ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

File 2 B [ ]

S1_6370 T/C 1   6370    +
S1_8210 T   1   8210    +
S1_8376 A   1   8376    +
S1_9889 A   1   9889    +

Output

1   ID=gene:GRMZM2G059865   4857    9652    -   S1_6370 T/C 6370    +   
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8210 T   8210    +
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8376 A   8376    +
1   ID=gene:GRMZM5G888250   9882    10387   -   S1_9889 A   9889    +

My general logic

loop (until end of A[ ] and B[ ])
if
B[$4]>A[$4] && B[$4]<A[$5]  #if the value in B column 4 is in between the values in A columns 4 & 5.
then
-F”\t” print {A[1], A[9(filtered)], A[$4FS$5], B[$1], B[$2], B[$3], B[$4], B[$5]}   #hopefully reflects awk column calls if the two files were able to have their columns defined that way.
movea++ # to see if the next set of B column 4 values is in between the values in A columns 4 & 5 
else
moveb++ #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

I know this logic doesn’t follow any language that I am aware of but is similar in parts. It seems like NR and FNR are two built in running values in awk. Awk helped me split up File 2 that had 10 values in B[$1] into 10 files quite easily and also cut helped with cutting out the few hundred columns (~255+) beyond the 5 you see here. Now I am working File 2 sizes around a couple MB instead of 1 file of 1.6 GB. Other than cutting down loading times, I wanted to simplify the loops. I haven’t backtracked to my previous attempts of python or powershell since I trimmed the file sizes down. I convinced myself they just weren’t going to read my files with their built in libraries or cmdlets. Which I’ll try sometime soon if I am unable to figure out an awk solution.

comparing multiple files and columns using awk #referenced Awk greater than less than but within a set range #referenced efficiently splitting one file into several files by value of column #the one thing that worked Using awk to get a specific string in line #might be able to filter column 9 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix? #this seemed the closest but without all the printing out in a third file I wanted, still not able to figure out the syntax completely

John1024

Try:

$ awk 'BEGIN{x=getline s <"B"; split(s,b,"\t")} !x{exit} {sub(/;.*/,"",$9); while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}}' OFS='\t' A
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_6370 T/C     6370    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8210 T       8210    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8376 A       8376    +
1       ID=gene:GRMZM5G888250   9882    10387   -       S1_9889 A       9889    +

How it works

This program implicitly loops through the lines of file A.

  • BEGIN{x=getline s <"B"; split(s,b,"\t")}

    Before we start reading file A, read the first line of file B into string s. Split that string up into array b using tabs as the separator.

    The function getline will set x to true until we run out of lines to read in file B.

  • !x{exit}

    If we have run out of lines to read in file B, then exit the program.

  • sub(/;.*/,"",$9)

    Remove everything after the ; from field 9 of file A.

  • while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}

    Loop through the lines of file B, printing the requested output as long as the fourth field of line B is between the values of fields 4 and 5 of file A.

    The function getline will set x to true until we run out of lines to read in file B.

  • OFS='\t'

    Make the output field separator a tab.

Multi-line version

For those who prefer their awk code split over multiple lines:

awk '

BEGIN{
    x=getline s <"B"
    split(s,b,"\t")
} 

!x {
    exit
} 

{   
    sub(/;.*/,"",$9)
    while (x && $4<b[4] && b[4]<$5) {
        print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]
        x=getline s <"B"; split(s,b,"\t")
    }
}
' OFS='\t' A

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Awk comparing 3 values, 2nd file value between 1st file values with multiple column printout between both files to a 3rd

From Dev

Merge multiple files: 1st Column (same string), 2nd Column (unique values per file)

From Dev

Merge multiple files: 1st Column (same string), 2nd Column (unique values per file)

From Dev

How to get the 3rd column in an array by using 1st and 2nd column values as index in Matlab

From Dev

awk compare 2 files, print match and nonmatch lines;3rd column of first file and 2nd column of second file

From Dev

Get the values from 2nd and 3rd table for the matched values in 1st table

From Dev

Compare 1st Column in 2 Files and Replace 3rd Column of File 1 with 4th Column of File 2

From Dev

Get a list of file name, and compare to a csv file while adding 2nd and 3rd column of the same 1st column

From Dev

AWK Retrieve text after a certain pattern where the 1st and 2nd columns match the values in the 1st and 2nd columns in an input file

From Dev

comparing values of 2 columns from same pandas dataframe & returning value of 3rd column based on comparison

From Dev

Given a CSV file, how do I delete the content between the 2nd and 3rd tabs of each row?

From Dev

Splitting a file into multiple files based on 1st column value

From Dev

Merge three file in to a single master file by excluding the header in 2nd and 3rd files

From Dev

Java:Three digit Sum - Find out all the numbers between 1 and 999 where the sum of 1st digit and 2nd digit is equal to 3rd digit

From Dev

Extract words between the 2nd and the 3rd comma

From Dev

How to search between the 2nd and 3rd delimiters

From Dev

Remove text between the 2nd and the 3rd colon

From Dev

Compare 1st column of the 1st file and 2nd column of the 2nd file and get the the output as following

From Dev

Compare 1st column of the 1st file and 2nd column of the 2nd file and matching

From Dev

merge 1st column and 2nd column; 3rd and 4th column and so on with a separator

From Dev

I want to add 2nd and 3rd column if 1st column within range of 1 to 10000

From Dev

comparing column values between two Unix files

From Dev

oracle sql query finding rows with multiple values in 3rd column matching columns 1 and 2

From Dev

In SQL, I need to generate a ranking (1st, 2nd, 3rd) column, getting stuck on "ties"

From Dev

In SQL, I need to generate a ranking (1st, 2nd, 3rd) column, getting stuck on "ties"

From Dev

VBA locating the 1st, 2nd, 3rd and 4th largest cells in a column for conditional formatting

From Dev

Awk - Match Values Between Two Files and Create a New File

From Dev

Find the average values in 2nd column for each distinct values in 1st column using Linux

From Dev

awk two files based on 1st & 2nd column

Related Related

  1. 1

    Awk comparing 3 values, 2nd file value between 1st file values with multiple column printout between both files to a 3rd

  2. 2

    Merge multiple files: 1st Column (same string), 2nd Column (unique values per file)

  3. 3

    Merge multiple files: 1st Column (same string), 2nd Column (unique values per file)

  4. 4

    How to get the 3rd column in an array by using 1st and 2nd column values as index in Matlab

  5. 5

    awk compare 2 files, print match and nonmatch lines;3rd column of first file and 2nd column of second file

  6. 6

    Get the values from 2nd and 3rd table for the matched values in 1st table

  7. 7

    Compare 1st Column in 2 Files and Replace 3rd Column of File 1 with 4th Column of File 2

  8. 8

    Get a list of file name, and compare to a csv file while adding 2nd and 3rd column of the same 1st column

  9. 9

    AWK Retrieve text after a certain pattern where the 1st and 2nd columns match the values in the 1st and 2nd columns in an input file

  10. 10

    comparing values of 2 columns from same pandas dataframe & returning value of 3rd column based on comparison

  11. 11

    Given a CSV file, how do I delete the content between the 2nd and 3rd tabs of each row?

  12. 12

    Splitting a file into multiple files based on 1st column value

  13. 13

    Merge three file in to a single master file by excluding the header in 2nd and 3rd files

  14. 14

    Java:Three digit Sum - Find out all the numbers between 1 and 999 where the sum of 1st digit and 2nd digit is equal to 3rd digit

  15. 15

    Extract words between the 2nd and the 3rd comma

  16. 16

    How to search between the 2nd and 3rd delimiters

  17. 17

    Remove text between the 2nd and the 3rd colon

  18. 18

    Compare 1st column of the 1st file and 2nd column of the 2nd file and get the the output as following

  19. 19

    Compare 1st column of the 1st file and 2nd column of the 2nd file and matching

  20. 20

    merge 1st column and 2nd column; 3rd and 4th column and so on with a separator

  21. 21

    I want to add 2nd and 3rd column if 1st column within range of 1 to 10000

  22. 22

    comparing column values between two Unix files

  23. 23

    oracle sql query finding rows with multiple values in 3rd column matching columns 1 and 2

  24. 24

    In SQL, I need to generate a ranking (1st, 2nd, 3rd) column, getting stuck on "ties"

  25. 25

    In SQL, I need to generate a ranking (1st, 2nd, 3rd) column, getting stuck on "ties"

  26. 26

    VBA locating the 1st, 2nd, 3rd and 4th largest cells in a column for conditional formatting

  27. 27

    Awk - Match Values Between Two Files and Create a New File

  28. 28

    Find the average values in 2nd column for each distinct values in 1st column using Linux

  29. 29

    awk two files based on 1st & 2nd column

HotTag

Archive