I have a data file A.tsv
(field separator = \t
) :
id clade mutation
243 40A titi,toto,lala
254
267 40B lala,jiji,jojo
and a template file B.tsv
(field separator = \t
) :
40A lala,toto,xixi,xaxa
40B xaxa,jojo,huhu
40C sasa,sisi,lala
Based on their common column (clade), I want to compare the mutation of A.tsv
from the template B.tsv
and indicate the number of match that it found in a new column in a new file (C.tsv
) like this:
id clade mutation number
243 40A titi,toto,lala 2
254
267 40B lala,jiji,jojo 1
I know how to compare two files like this:
awk -F"," -vOFS="," '
NR==FNR {
a[$2]=$3;
next
}
{ print $0,a[$2] }
' B.tsv A.tsv > C.tsv
but I don't know how to count the match. Do you have an idea?
A SECOND QUESTION:
I'm wondering how to make a new column with only the information on how many mutations are present in B.tsv
. Example for the column total_mut
in C.tsv
:
id clade mutation number total_mut
243 40A titi,toto,lala 2 4
254
267 40B lala,jiji,jojo 1 3
The method is to make an array indexed by clade and mutation, from the B file. Then iterate the mutations from the A file.
Somewhat tricky to deal with a tab-separated file, especially keeping the number of columns where there is no clade.
We define the necessary column numbers for the A file as cClade and cMut, and changed these to match the full data format.
For the follow-up question, we save nMut (number of mutations), which split() already returns, and add it to the prints (header and detail). Tested this version too.
#! /bin/bash
Match () { #:: (data, template)
Awk='
BEGIN { FS = "\t"; Sep = ","; cClade = 20; cMut = 41; }
F == "B" {
nMut[$1] = split ($2, V, Sep);
for (j in V) Mut[$1 Sep V[j]];
next;
}
! $2 { printf ("%s%s%s\n", $0, FS, FS); next; }
FNR == 1 { printf ("%s%s%s%s%s\n", $0, FS, "number", FS, "total_mut"); next; }
{
n = 0;
split ($cMut, V, Sep);
for (j in V) if (($cClade Sep V[j]) in Mut) ++n;
printf ("%s%s%s%s%s\n", $0, FS, n, FS, nMut[$cClade]);
}
'
awk -f <( printf '%s' "${Awk}" ) F="B" "${2}" F="A" "${1}"
}
Match useTemplate.A.tsv useTemplate.B.tsv > useTemplate.C.tsv
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments