다른 유기체 유형을 포함하는 genbank 파일에서 최대, 최소 및 평균 길이를 얻고 싶습니다. 모든 유기체에 대해 이것을 얻고 싶습니다.
예:
Organism: Homo sapiens
average length = 160
shortest length = 20 | id of shortest seq
longest length = 500 | id of longest seq
Organism: Caenorhabditis elegans
average length = 140
shortest length = 40 | id of shortest seq
longest length = 300 | id of longest seq
길이는 알아 냈지만 유기체별로 구분할 수 없습니다
use strict;
use warnings;
print "enter file path: ";
my $filename = <>;
chomp ($filename);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//";
my %organisms ;
while (my $block = <IN>) {
next if $block =~ /^\s*\n\s*$/;
my ($definition , $sequence) = split "ORIGIN", $block;
my $accession;
$definition =~ m/(ACCESSION.+[0-9])/x
? $accession = $1
: die "No ACCESSION";
my $organism;
$definition =~ m/(ORGANISM\s+.+\n)/x
? $organism = $1
: die "No ORGANISM";
$sequence =~ s/[\d\n\s\t\/]//g;
$organisms{ $sequence } = [ $organism, $accession ];
}
my $sum = 0;
foreach my $sequence( keys %organisms) {
my $current_len = length($sequence);
$sum += $current_len;
};
my $number_seqs = scalar keys %organisms;
my $average = ($sum / $number_seqs);
print "average length = $average \n";
my @sorted_keys = sort { length $a <=> length $b } keys %organisms ;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
my $short = length $shortest;
my $long = length $longest;
my $short_id = $organisms{$shortest}->[1];
my $long_id = $organisms{$longest}->[1];
my $short_type = $organisms{$shortest}->[0];
my $long_type = $organisms{$longest}->[0];
print "shortest length = $short | $short_id | $short_type\n";
print "longest length = $long | $long_id | $long_type\n";
exit;
각 유기체의 길이를 인쇄하려면 어느 부분을 변경해야합니까?
입력 예 :
LOCUS NM_001112 40 bp mRNA linear PRI 20-AP
DEFINITION Homo sapiens transcript variant 5, mRNA.
ACCESSION NM_001112
KEYWORDS RefSeq.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
ORIGIN
1 actgggcggc ccttagaccc
//
LOCUS NM_854 212 bp mRNA linear PRI 20-AP
DEFINITION Homo sapiens transcript variant 1, mRNA.
ACCESSION NM_854
KEYWORDS RefSeq.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
ORIGIN
1 gggcaaaag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag
//
LOCUS AW057463 469 bp mRNA linear EST 29-SE
DEFINITION ca03d09.x1 C elegans fem3 Q23 S1 Caenorhabditis elegans cDNA 3'
ACCESSION AW057463
VERSION AW057463.1 GI:5933102
DBLINK BioSample: LIBEST_002392
KEYWORDS EST.
SOURCE Caenorhabditis elegans
ORGANISM Caenorhabditis elegans
ORIGIN
1 ttttactcaa aactatctat ccaagttaat cagtagtgtt agttctagtt aagttattaa
61 ggcgcacggt ctgtctcctt gcttcttctc tttgtatccc ctttctcctt tttcaaaact
121 tcactttcat caataattgg ttctttagaa tacagttttc caatttccac gtactctctt
181 ctcttccgat ccttgtcaaa ctttttcttc gggagctcat cttctggaac tactttcaca
//
LOCUS AW04463 259 bp mRNA linear EST 14-SE
DEFINITION ca02d86.x1 C elegans fem6 S12 Q3 Caenorhabditis elegans cDNA 3'
ACCESSION AW04463
VERSION AW04463.1 GI:90872
DBLINK BioSample: LIBEST_004372
KEYWORDS EST.
SOURCE Caenorhabditis elegans
ORGANISM Caenorhabditis elegans
ORIGIN
241 tttttcgatg gaaccaaacg ggaacgagtt ggcttttcca ccaaaagatt agcgtactcc
301 gaactgtatt tccccttctt tttcttttca agaggaacat tttctcgttg agtatcatcg
361 tcctccaaac tttgttgagt agtcatggac tgggtccgag agaattcaac ggtaggcatg
421 gaacctttgc tcttgtcgtc gtttgccttt ggtgcctttc ccttttgaa
//
각 유기체의 길이를 인쇄하려면 어느 부분을 변경해야합니까?
프로그램의 많은 부분을 변경해야한다고 생각합니다 (주로 시퀀스를 해시 키로 사용하는 논리를 변경하여 대신 유기체를 해시 키로 사용합니다). 다음은 제안입니다.
use feature qw(say);
use strict;
use warnings;
my $filename = 'datafile.txt';
chomp ($filename);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//";
my %organisms ;
while (my $block = <IN>) {
next if $block =~ /^\s*\n\s*$/;
my ($definition , $sequence) = split "ORIGIN", $block;
my $accession;
$definition =~ m/ACCESSION\s*(\S+)/x
? $accession = $1
: die "No ACCESSION";
my $organism;
$definition =~ m/ORGANISM\s*(\S.+)\n/x
? $organism = $1
: die "No ORGANISM";
$sequence =~ s/[\d\n\s\t\/]//g;
push @{ $organisms{ $organism }{info} }, [ $sequence, $accession ];
}
for my $organism ( keys %organisms ) {
say "Organism: $organism";
my $info = $organisms{ $organism }{info};
my $sum = 0;
map { $sum += length $_->[0] } @$info;
my ( $min_len, $min_id, $max_len, $max_id ) = map { length $_->[0], $_->[1] }
(sort { length $a->[0] <=> length $b->[0]} @$info)[0,-1];
say "average length = " . $sum / ( scalar @$info );
say "shortest length = $min_len | $min_id";
say "longest length = $max_len | $max_id";
say "";
}
출력은 다음과 같습니다.
Organism: Homo sapiens
average length = 39.5
shortest length = 20 | NM_001112
longest length = 59 | NM_854
Organism: Caenorhabditis elegans
average length = 234.5
shortest length = 229 | AW04463
longest length = 240 | AW057463
이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.
침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제
몇 마디 만하겠습니다