根据序列 ID 从文件中提取 FASTA 序列
Extract FASTA sequences from a file based on sequence IDs
我有两个文件。 seq.fasta
包含 FASTA 序列,ids.txt
包含要从 seq.fasta
中提取的序列 ID
例如
seq.fasta
>AUP4056.1
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI`
>XIM5213
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKD
>bcna2598.1
GPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCLPTNDDTHLKSEGQK
ids.txt
AUP4056.1 bcna2598.1 YUP42568 CAD42579.3
JIK6023.5 ZNB708645
我尝试了以下程序作为对
但它只是将 seq.fasta
文件复制到输出。
Perl 代码
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $id_file, '<', 'ids.txt' ) or die $!;
#use split here, to split any lines on whitespace.
chomp ( my @ids = map { split } <$id_file> );
close ( $id_file );
my %sequences;
open ( my $input, '<', 'seq.fasta' ) or die $!;
{
local $/ = ''; #paragraph mode; Read until blank line
while ( <$input> ) {
my ( $id, $sequence ) = m/>\s*(\S+)\n(.*)/ms;
$sequences{$id} = $sequence;
}
}
foreach my $id ( @ids ) {
if ( $sequences{$id} ) {
print ">$id\n";
print "$sequences{$id}\n";
}
}
close ($input);
exit;
谁能告诉我哪里出错了?
更新:
我想将输出存储在一个单独的文件中。
您使用的代码适用于序列之间有空行的 FASTA 文件。你的没有,所以它失败了
这应该可以正常工作,尽管我无法对其进行测试
use strict;
use warnings 'all';
my %ids = do {
open my $fh, '<', 'ids.txt'
or die qq{Unable to open "ids.txt" for input: $!};
local $/;
map { $_ => undef } split ' ', <$fh>;
};
open my $fh, '<', 'seq.fasta'
or die qq{Unable to open "seq.fasta" for input: $!};
my $print;
while ( <$fh> ) {
$print = exists $ids{} if /^>(\S+)/;
print if $print;
}
我有两个文件。 seq.fasta
包含 FASTA 序列,ids.txt
包含要从 seq.fasta
例如
seq.fasta
>AUP4056.1
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI`
>XIM5213
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKD
>bcna2598.1
GPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCLPTNDDTHLKSEGQK
ids.txt
AUP4056.1 bcna2598.1 YUP42568 CAD42579.3
JIK6023.5 ZNB708645
我尝试了以下程序作为对
seq.fasta
文件复制到输出。
Perl 代码
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $id_file, '<', 'ids.txt' ) or die $!;
#use split here, to split any lines on whitespace.
chomp ( my @ids = map { split } <$id_file> );
close ( $id_file );
my %sequences;
open ( my $input, '<', 'seq.fasta' ) or die $!;
{
local $/ = ''; #paragraph mode; Read until blank line
while ( <$input> ) {
my ( $id, $sequence ) = m/>\s*(\S+)\n(.*)/ms;
$sequences{$id} = $sequence;
}
}
foreach my $id ( @ids ) {
if ( $sequences{$id} ) {
print ">$id\n";
print "$sequences{$id}\n";
}
}
close ($input);
exit;
谁能告诉我哪里出错了?
更新:
我想将输出存储在一个单独的文件中。
您使用的代码适用于序列之间有空行的 FASTA 文件。你的没有,所以它失败了
这应该可以正常工作,尽管我无法对其进行测试
use strict;
use warnings 'all';
my %ids = do {
open my $fh, '<', 'ids.txt'
or die qq{Unable to open "ids.txt" for input: $!};
local $/;
map { $_ => undef } split ' ', <$fh>;
};
open my $fh, '<', 'seq.fasta'
or die qq{Unable to open "seq.fasta" for input: $!};
my $print;
while ( <$fh> ) {
$print = exists $ids{} if /^>(\S+)/;
print if $print;
}