根据序列号删除部分行

融合坡度

我正在处理来自下一代测序数据的fastq文件,其组织如下:

  • 第1行中有关定序机,通道,“平铺”和读取编号的信息
  • 第2行有关序列的信息
  • 第3行是一个符号+用作分隔符
  • 第4行有关读取质量的信息

我想删除第一行第五个位置(黑色)中特定数字间隔内的所有读取。

这里有一个读取示例,用于从1101到1103之间删除。输入:

@ ST-E00204:114:HHKTJALXX:4:1101:22962:1538_1:N:0:1/1 
NGTGTTTTTAATTATTAAGTTATTTTTTTAGTTTTTTAAGGATTTTTATAGTAGTAATAGAAATTTAATTAAGATAGAAAATTTTAAGTGTGGTTAGGATTGTAGTTTTGTTGGTATTATGTTGATTTAGTATAAGTAAAGTTTTGATTTT 
+ 
AAAAJJJJFJJAJJJJAJAJJJJJJJJJAJJ-FJJJJF - FJJJJFJJJFFJJJFFJ-JJJJFFFFJ-AJ7AJJJJJJJJJJFJJJJFJFAJFFJJF-AAFAJFJJ7AJAJJFJFJJJ7FFFFFJFJJ-7F-77AJF - 7FJJ 


@ ST-E00204: 114:HHKTJALXX:4:1102:7101:2012 2:N:0:1 
NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAATTATTCTCATTTTCCATAACTTCCAATTTT 
+
!AA-FJJJJJJJ-FJAJFJJJJFJJAJJJJJ-F-AJJJJJ-F-FJJFJJFJFFJFFFF <-F <FJJJF- <7 <JF <-7AAFFJ-A <A77--7FAAF-A ---- 7F -7-<F7 --- 77 <---- 7 - << FA --- 7 << --- --- 7 

ST-E00204:114:HHKTJALXX:4:1103:7141:2012 2:N:0:1 
NAAAACATAAAATATAACAAACAAACTAAAAATCATAAAAAATAAAAAACATCCACTTAACAACTTAAAAAATAACAAAATCACTAATTATAATAAAAAATAAAAAATACACACTCTAACACCTAAAACACAACCAAAAAACTAAAACTCC 
+, AAFFFFJJJJA-是F-AFFJJ-是F <JJF <AJFJ <JF-7 <JJAA7- J-FFFJ7JJJFJ-是F <AJJJFFJ-是AF-是AJ <FF-JFFF-77 <- -JJ Smuts,777 <7 ---- 7-A <JA-7 << FFF <-7--7-FFFF-<--- 7 --- 7A- <A7FA ------ 7 - 

@ ST- E00204:114:HHKTJALXX:4:1104:7101 :2012 2:N:0:1 
NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAATAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAAATTATTCTCATTTTCCATAACTTCCAATTTT
+ 
!AA-FJJJJJJJ-FJAJFJJJJJFJJAJJJJJJ-F-AJJJJJF-FJJFJJFJFFFFF <-F <FJJJF- <7 <JF <-7AAFFJ-A <A77--7FAAF-A ---- 7FF-7-7 <FJ < A-7-<F7 --- 77 <---- 7-<< FA --- 7 << --- 7 ---

所需的输出:

@ ST-E00204:114:HHKTJALXX:4:1104:7101:2012 2:N:0:1 
NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAATAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAATTATTCTCATTTTCCATAACTTCCAATTTT 
+ 
!AA-FJJJJJJ-FJAJFJJJJJFJJAJJJJJJ-F-F-AJJJJJJ FJJFJJFJFFFFF <-F <FJJJF- <7 <JF <-7AAFFJ - A <A77--7FAAF-A ---- 7FF-7-7 <FJ <A-7-<F7 --- 77 <---- 7-<< FA --- 7 <<---- 7 ---

一个想法是使用:

split -l 4 myfile.fq

然后按照第5位的数字删除每个文件,例如:

grep -v ":1104"
grep -v ":1105"

等等,但是问题是文件很大。另外,我还必须删除较大的时间间隔,例如从1000到2000,并且每个数字都对应于很多读取。

罗曼·佩列赫雷斯特(RomanPerekhrest)

awk解决方案:

awk -F':' -v RS="@" 'NR>1 && ($5<1101 || $5>1103){ print RS$0 }' myfile.fq

输出:

@ST-E00204:114:HHKTJALXX:4:1104:7101:2012 2:N:0:1
NATTTAAAAATACCCACTATAAAACATAAAATATAACAAAAAAACTAAAAATCATAAAAAATAAAAAAAATCCACTTCACGTCTTTTAACAATTTCGTCATTTTTAACATCCTCAAATAAATTATTCTCATTTTCCATAACTTCCAATTTT
+
!A-A-FJJJJJJ-FJAJFJJJJJFJJAJJJJJJ-F-AJJJJJJ-F-FJJFJJFJFFFFF<-F

详细资料

  • -F':' -场分隔符 :

  • -v RS="@"-考虑@作为记录分隔符

  • ($5<1101 || $5>1103)-检查所需字段是否符合条件“从1101到1103之间删除

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章