我有一个制表符分隔的表,其前三行如下所示-一个标题行和前两个条目:
Geneid Chr Start End Strand Length Feature_count contig_ID MAG_id RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein G1_719_cleanedcontig_v2_1580 346495 347049 + 555 68733 NODE_28_length_349332_cov_12.741083 ag0r3_bin.39 11455.58033225708
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein G1_719_cleanedcontig_v2_1582 147164 151051 - 3888 61026 NODE_113_length_189623_cov_11.186889 ag0r3_bin.39 1451.8890393965803
我想为每一行提取“ ID”和第一个分号之间的信息(例如,对于第一行“ G1_719_cleanedcontig_v2_1582_130”),并将其放在右侧的列中。我该如何使用Bash或Python执行此操作或两者结合?
假设数据框为
text
0 Geneid Chr Start End Strand Length Featur...
1 ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...
2 ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...
只需提取ID=
AND之间的字符;
df['newcolumn']=df.text.str.extract('(?<=[ID]\=)(.*?)(?=\;)')
text \
0 Geneid Chr Start End Strand Length Featur...
1 ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...
2 ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...
newcolumn
0 NaN
1 G1_719_cleanedcontig_v2_1580_319
2 G1_719_cleanedcontig_v2_1582_130
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句