我下面有数据块(多个)
chr1.trna4 (17188416-17188486) Length: 71 bp
Type: Gly Anticodon: CCC at 33-35 (17188448-17188450) Score: 78.3
HMM Sc=56.60 Sec struct Sc=21.70
* | * | * | * | * | * | * |
Seq: GCATTGGTGGTTCAGTGGTAGAATTCTCGCCTCCCACGCGGGAGaCCCGGGTTCAATTCCCGGCCAATGCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.
对于每个块,我需要在以开头的块的最后一行中找到第8个模式Str
。在上述情况下,第8个模式为.......
(7个周期)。这是因为第一组>
符号构成一个模式,第二组周期构成第二个模式,依此类推。
现在,我需要从Seq
图案线正上方的那一行中提取这7个字符。在该示例中,这对应于子序列CTCCCAC
。
输出应为 Seq is CTCCCAC and Anticodon: CCC
这在bash
壳中或任何壳中都可能吗?
数据块的更多示例
chr19.trna11 (4724719-4724647) Length: 73 bp
Type: Val Anticodon: CAC at 34-36 (4724686-4724684) Score: 79.2
HMM Sc=49.10 Sec struct Sc=30.10
* | * | * | * | * | * | * |
Seq: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA
Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chr19.trna12 (1383433-1383361) Length: 73 bp
Type: Phe Anticodon: GAA at 34-36 (1383400-1383398) Score: 88.9
HMM Sc=68.40 Sec struct Sc=20.50
* | * | * | * | * | * | * |
Seq: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chr21.trna1 (18827177-18827107) Length: 71 bp
Type: Gly Anticodon: GCC at 33-35 (18827145-18827143) Score: 80.9
HMM Sc=60.10 Sec struct Sc=20.80
* | * | * | * | * | * | * |
Seq: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.
chrX.trna4 (18693101-18693029) Length: 73 bp
Type: Val Anticodon: TAC at 34-36 (18693068-18693066) Score: 82.9
HMM Sc=54.70 Sec struct Sc=28.20
* | * | * | * | * | * | * |
Seq: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA
Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chrX.trna6 (3833344-3833271) Length: 74 bp
Type: Ile Anticodon: GAT at 35-37 (3833310-3833308) Score: 75.5
HMM Sc=50.20 Sec struct Sc=25.30
* | * | * | * | * | * | * |
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chrX.trna8 (3794915-3794842) Length: 74 bp
Type: Ile Anticodon: GAT at 35-37 (3794881-3794879) Score: 75.5
HMM Sc=50.20 Sec struct Sc=25.30
* | * | * | * | * | * | * |
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chrX.trna10 (3756491-3756418) Length: 74 bp
Type: Ile Anticodon: GAT at 35-37 (3756457-3756455) Score: 75.5
HMM Sc=50.20 Sec struct Sc=25.30
* | * | * | * | * | * | * |
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chr19.trna8 (45981945-45981859) Length: 87 bp
Type: SeC Anticodon: TCA at 36-38 (45981910-45981908) Score: 146.9
HMM Sc=0.00 Sec struct Sc=0.00
* | * | * | * | * | * | * | * | *
Seq: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG
Str: >>>>>>>.>..>>>>>>....<<<<<<<<<<<<.......<<<<<<.>>>>>....<<<<<.>>>>.......<<<<<.<<<<<<<.
鉴于我们可以将起始索引与反密码子一起提取:
len=7
prior=2
while IFS= read -r line; do
if [[ $line =~ Anticodon:" "([[:alpha:]]+)" at "([0-9]+) ]]; then
anticodon=${BASH_REMATCH[1]}
start=$(( BASH_REMATCH[2] - 1)) # string indexing is zero-based
elif [[ $line == "Seq: "* ]]; then
seq=${line#Seq: }
printf "Seq: %s, Anticodon: %s\n" "${seq:start-prior:len}" "$anticodon"
fi
done < file
一个更复杂的解决方案,它每次都解析“ Str:”行,但是不将长度硬编码为7(它确实对“ nth”模式进行硬编码):
8thSeq() {
local seq=$1 str=$2
local last=${str:0:1}
local nth=8 n=1 start
for (( i=1; i < ${#str}; i++)); do
if [[ "${str:i:1}" != "$last" ]]; then
((n++))
if ((n == nth)); then
start=$i
elif ((n == nth+1)); then
echo "${seq:start:i-start}"
break
fi
fi
last=${str:i:1}
done
}
while IFS= read -r line; do
if [[ $line =~ Anticodon:" "([[:alpha:]]+) ]]; then
anticodon=${BASH_REMATCH[1]}
elif [[ $line == "Seq: "* ]]; then
seq=${line#Seq: }
elif [[ $line == "Str: "* ]]; then
str=${line#Str: }
printf "Seq: %s, Anticodon: %s\n" "$(8thSeq "$seq" "$str")" "$anticodon"
fi
done < file
使用“更多”数据,两个解决方案均输出
Seq: CTCACAC, Anticodon: CAC
Seq: CTGAAGA, Anticodon: GAA
Seq: CTGCCAC, Anticodon: GCC
Seq: TTTACAC, Anticodon: TAC
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTTCAAA, Anticodon: TCA
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句