我正在处理来自下一代测序数据的fastq文件,其组织如下:
我想删除第一行第五个位置(黑色)中特定数字间隔内的所有读取。
这里有一个读取示例,用于从1101到1103之间删除。输入:
@ ST-E00204:114:HHKTJALXX:4:1101:22962:1538_1:N:0:1/1 NGTGTTTTTAATTATTAAGTTATTTTTTTAGTTTTTTAAGGATTTTTATAGTAGTAATAGAAATTTAATTAAGATAGAAAATTTTAAGTGTGGTTAGGATTGTAGTTTTGTTGGTATTATGTTGATTTAGTATAAGTAAAGTTTTGATTTT + AAAAJJJJFJJAJJJJAJAJJJJJJJJJAJJ-FJJJJF - FJJJJFJJJFFJJJFFJ-JJJJFFFFJ-AJ7AJJJJJJJJJJFJJJJFJFAJFFJJF-AAFAJFJJ7AJAJJFJFJJJ7FFFFFJFJJ-7F-77AJF - 7FJJ @ ST-E00204: 114:HHKTJALXX:4:1102:7101:2012 2:N:0:1 NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAATTATTCTCATTTTCCATAACTTCCAATTTT + !AA-FJJJJJJJ-FJAJFJJJJFJJAJJJJJ-F-AJJJJJ-F-FJJFJJFJFFJFFFF <-F <FJJJF- <7 <JF <-7AAFFJ-A <A77--7FAAF-A ---- 7F -7-<F7 --- 77 <---- 7 - << FA --- 7 << --- --- 7 ST-E00204:114:HHKTJALXX:4:1103:7141:2012 2:N:0:1 NAAAACATAAAATATAACAAACAAACTAAAAATCATAAAAAATAAAAAACATCCACTTAACAACTTAAAAAATAACAAAATCACTAATTATAATAAAAAATAAAAAATACACACTCTAACACCTAAAACACAACCAAAAAACTAAAACTCC +, !AAFFFFJJJJA-是F-AFFJJ-是F <JJF <AJFJ <JF-7 <JJAA7- J-FFFJ7JJJFJ-是F <AJJJFFJ-是AF-是AJ <FF-JFFF-77 <- -JJ Smuts,777 <7 ---- 7-A <JA-7 << FFF <-7--7-FFFF-<--- 7 --- 7A- <A7FA ------ 7 - @ ST- E00204:114:HHKTJALXX:4:1104:7101 :2012 2:N:0:1 NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAATAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAAATTATTCTCATTTTCCATAACTTCCAATTTT + !AA-FJJJJJJJ-FJAJFJJJJJFJJAJJJJJJ-F-AJJJJJF-FJJFJJFJFFFFF <-F <FJJJF- <7 <JF <-7AAFFJ-A <A77--7FAAF-A ---- 7FF-7-7 <FJ < A-7-<F7 --- 77 <---- 7-<< FA --- 7 << --- 7 ---
所需的输出:
@ ST-E00204:114:HHKTJALXX:4:1104:7101:2012 2:N:0:1 NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAATAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAATTATTCTCATTTTCCATAACTTCCAATTTT + !AA-FJJJJJJ-FJAJFJJJJJFJJAJJJJJJ-F-F-AJJJJJJ FJJFJJFJFFFFF <-F <FJJJF- <7 <JF <-7AAFFJ - A <A77--7FAAF-A ---- 7FF-7-7 <FJ <A-7-<F7 --- 77 <---- 7-<< FA --- 7 <<---- 7 ---
一个想法是使用:
split -l 4 myfile.fq
然后按照第5位的数字删除每个文件,例如:
grep -v ":1104"
grep -v ":1105"
等等,但是问题是文件很大。另外,我还必须删除较大的时间间隔,例如从1000到2000,并且每个数字都对应于很多读取。
awk解决方案:
awk -F':' -v RS="@" 'NR>1 && ($5<1101 || $5>1103){ print RS$0 }' myfile.fq
输出:
@ST-E00204:114:HHKTJALXX:4:1104:7101:2012 2:N:0:1
NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAATAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAATTATTCTCATTTTCCATAACTTCCAATTTT
+
!A-A-FJJJJJJ-FJAJFJJJJJFJJAJJJJJJ-F-AJJJJJJ-F-FJJFJJFJFFFFF<-F
详细资料:
-F':'
-场分隔符 :
-v RS="@"
-考虑@
作为记录分隔符
($5<1101 || $5>1103)
-检查所需字段是否符合条件“从1101到1103之间删除”
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句