我有一个这样的xml文件
<pr_id>01</pr_id>
<uniprot>O11482</uniprot>
<uniprot>O96642</uniprot>
<uniprot>Q67845</uniprot>
<column>
<column_id>1</column_id>
column_start>300</column_start>
<column_end>334</column_end>
<old_new>old</old_new>
<comment></comment>
</column>
<column>
<column_id>2</column_id>
<column_start>335</column_start>
<column_end>337</column_end>
<old_new>new</old_new>
<comment></comment>
<pr_id>02</pr_id>
<uniprot>P4455</uniprot>
<uniprot>89WER8</uniprot>
<uniprot>Q12845</uniprot>
<column>
<column_id>1</column_id>
<column_start>12</column_start>
<column_end>34</column_end>
<old_new>old</old_new>
<comment></comment>
</column>
<column>
<column_id>2</column_id>
<column_start>35</column_start>
<column_end>37</column_end>
<old_new>old</old_new>
<comment></comment>
我想获得如下输出。
pr_id uniprot old_start old_end
01 O11482 300 334
02 P4455 12 34
02 P4455 35 37
实现这一目标的简单方法是什么?这是我第一次处理xml文件。您的宝贵建议将不胜感激!
在Gnu Awk版本4中,可以使用以下split()
功能:
gawk -f a.awk file.xml
在哪里a.awk
:
BEGIN {RS="^$"}
{
n=split($0,a,/<\/?(uniprot|pr_id|column_start|column_end|old_new)>/,s)
for (i=1; i<=n-1;i+=2) {
if (s[i]=="<pr_id>") {pp=a[i+1]; up=0}
if (s[i]=="<uniprot>" && up==0) {uu=a[i+1];up=1}
if (s[i]=="<column_start>") ss=a[i+1]
if (s[i]=="<column_end>") ee=a[i+1]
if (s[i]=="<old_new>" && a[i+1]=="old") {
p[++k]=pp
u[k]=uu
s[k]=ss
e[k]=ee
}
}
}
END {
fmt="%5s%10s%10s%10s\n"
printf fmt, "pr_id", "uniprot", "old_start", "old_end"
for (i=1; i<=k; i++)
printf fmt,p[i],u[i],s[i],e[i]
}
输出:
pr_id uniprot old_start old_end
01 O11482 300 334
02 P4455 12 34
02 P4455 35 37
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句