如何使用 Perl 脚本从 FASTA 文件中的匹配字符串中提取 ID?
How to extract IDs from a matched string in FASTA files using Perl script?
我是 Perl 编程的新手,一直坚持使用我的脚本。
我正在尝试在 FASTA 文件中搜索基序,如果找到,打印出包含该基序的蛋白质的 ID。
我可以加载我的文件,但是在放置我的主题之后,没有任何反应。
我收到以下错误:Use of uninitialized value $data[0] in concatenation (.) or string at test.pl line 36, <STDIN> line 2.
这是我的代码:
#!/usr/bin/perl -w
# Searching for motifs
print "Please type the filename of the protein sequence data: ";
$proteinfilename = <STDIN>;
# Remove the newline from the protein filename
chomp $proteinfilename;
# open the file, or exit
unless ( open(FA, $proteinfilename) ) {
print "Cannot open file \"$proteinfilename\"\n\n";
exit;
}
@protein = <FA>; # Read the protein sequence data from the file, and store it into the array variable @protein
my (@description, @ID, @data);
while (my $protein = <FA>) {
chomp($protein);
@description = split (/\s+/, $protein);
push (@ID, $description[0]);
}
# Close the file
close FA;
my %params = map { $_ => 1 } @ID;
# Put the protein sequence data into a single string, as it's easier to search for a motif in a string than in an array of lines
$protein = join( '', @protein);
# Remove whitespace
$protein =~ s/\s//g;
# ask for a motif or exit if no motif is entered.
do {
print "Enter a motif to search for: ";
$motif = <STDIN>;
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
@data = split (/\s+/, $protein);
if ( $protein =~ /$motif/ ) {
print $description[0]."\n" if(exists($params{$data[0]}));
}
# exit on an empty user input
} until ( $motif =~ /^\s*$/ );
# exit the program
exit;
输入的例子是:
>sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2
MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP
RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
假设我想在给定序列中找到基序 PMET
。如果它存在,我想得到一个 ID 作为输出 -> O60341
我在这里为单行输入文件编码作为示例。
my $motif = <STDIN>;
chomp($motif);
my $str = "sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2 MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY";
if($str=~m/$motif/)
{
if($str=~m/^([^|]+)\|([^|]+)\|/gm)
{
print "Expected Value: \n";
}
}
else { print "Not matched...\n"; }
Input>$: PMET
Expected Value: O60341
如果您在 perl 中使用 fasta 文件,请帮自己一个忙,使用 BioPerl 而不是尝试滚动您自己的 reader 代码。让事情变得更简单、更清晰。
#!/usr/bin/env perl
use strict;
use warnings; # Prefer this instead of the -w switch
use feature qw/say/;
use Bio::SeqIO;
my $reader = Bio::SeqIO->new(-format => "fasta",
# Use -file => "foo.fasta" to read from a file instead.
-fh => \*DATA);
my $motif = "PMET";
while (my $seq = $reader->next_seq) {
if ($seq->seq =~ /\Q$motif/) {
my $id = (split /\|/, $seq->primary_id)[1];
say "Matched: $id";
}
}
__DATA__
>sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2
MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP
RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
以下代码匹配 $motif
并输出 $id
use strict;
use warnings;
use feature 'say';
my $motif = 'PMET';
while(<DATA>) {
next unless /$motif/;
my $id = (split '\|')[1];
say "Found: $id";
}
__DATA__
sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2 MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
输出
Found: O60341
如果您想提取更详细的信息,请执行以下代码执行此功能
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $debug = 1;
my $motif = 'PMET';
while(<DATA>) {
next unless /$motif/;
my $record = fill_record($_);
say 'Found: ' . $record->{id};
say Dumper($record) if $debug;
}
sub fill_record {
my $data = shift;
my(%record,@dna);
(@record{qw/x id/}, $data) = split '\|', $data;
($data,@dna) = $data =~ /(.*?) (\S+) (\S+)$/;
$record{dna} = \@dna;
$data =~ /(.*?) \w+=/;
$record{desc} = ;
%{$record{group}} = $data =~ /\b(\w+)=([^=]+) (?=\w+=)/g;
return \%record;
}
__DATA__
sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2 MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
输出
Found: O60341
$VAR1 = {
'x' => 'sp',
'id' => 'O60341',
'group' => {
'OX' => '9606',
'OS' => 'Homo sapiens',
'GN' => 'KDM1A',
'PE' => '1'
},
'desc' => 'KDM1A_HUMAN Lysine-specific histone demethylase 1A',
'dna' => [
'MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP',
'RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY'
]
};
我是 Perl 编程的新手,一直坚持使用我的脚本。
我正在尝试在 FASTA 文件中搜索基序,如果找到,打印出包含该基序的蛋白质的 ID。
我可以加载我的文件,但是在放置我的主题之后,没有任何反应。
我收到以下错误:Use of uninitialized value $data[0] in concatenation (.) or string at test.pl line 36, <STDIN> line 2.
这是我的代码:
#!/usr/bin/perl -w
# Searching for motifs
print "Please type the filename of the protein sequence data: ";
$proteinfilename = <STDIN>;
# Remove the newline from the protein filename
chomp $proteinfilename;
# open the file, or exit
unless ( open(FA, $proteinfilename) ) {
print "Cannot open file \"$proteinfilename\"\n\n";
exit;
}
@protein = <FA>; # Read the protein sequence data from the file, and store it into the array variable @protein
my (@description, @ID, @data);
while (my $protein = <FA>) {
chomp($protein);
@description = split (/\s+/, $protein);
push (@ID, $description[0]);
}
# Close the file
close FA;
my %params = map { $_ => 1 } @ID;
# Put the protein sequence data into a single string, as it's easier to search for a motif in a string than in an array of lines
$protein = join( '', @protein);
# Remove whitespace
$protein =~ s/\s//g;
# ask for a motif or exit if no motif is entered.
do {
print "Enter a motif to search for: ";
$motif = <STDIN>;
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
@data = split (/\s+/, $protein);
if ( $protein =~ /$motif/ ) {
print $description[0]."\n" if(exists($params{$data[0]}));
}
# exit on an empty user input
} until ( $motif =~ /^\s*$/ );
# exit the program
exit;
输入的例子是:
>sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2
MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP
RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
假设我想在给定序列中找到基序 PMET
。如果它存在,我想得到一个 ID 作为输出 -> O60341
我在这里为单行输入文件编码作为示例。
my $motif = <STDIN>;
chomp($motif);
my $str = "sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2 MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY";
if($str=~m/$motif/)
{
if($str=~m/^([^|]+)\|([^|]+)\|/gm)
{
print "Expected Value: \n";
}
}
else { print "Not matched...\n"; }
Input>$: PMET
Expected Value: O60341
如果您在 perl 中使用 fasta 文件,请帮自己一个忙,使用 BioPerl 而不是尝试滚动您自己的 reader 代码。让事情变得更简单、更清晰。
#!/usr/bin/env perl
use strict;
use warnings; # Prefer this instead of the -w switch
use feature qw/say/;
use Bio::SeqIO;
my $reader = Bio::SeqIO->new(-format => "fasta",
# Use -file => "foo.fasta" to read from a file instead.
-fh => \*DATA);
my $motif = "PMET";
while (my $seq = $reader->next_seq) {
if ($seq->seq =~ /\Q$motif/) {
my $id = (split /\|/, $seq->primary_id)[1];
say "Matched: $id";
}
}
__DATA__
>sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2
MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP
RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
以下代码匹配 $motif
并输出 $id
use strict;
use warnings;
use feature 'say';
my $motif = 'PMET';
while(<DATA>) {
next unless /$motif/;
my $id = (split '\|')[1];
say "Found: $id";
}
__DATA__
sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2 MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
输出
Found: O60341
如果您想提取更详细的信息,请执行以下代码执行此功能
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $debug = 1;
my $motif = 'PMET';
while(<DATA>) {
next unless /$motif/;
my $record = fill_record($_);
say 'Found: ' . $record->{id};
say Dumper($record) if $debug;
}
sub fill_record {
my $data = shift;
my(%record,@dna);
(@record{qw/x id/}, $data) = split '\|', $data;
($data,@dna) = $data =~ /(.*?) (\S+) (\S+)$/;
$record{dna} = \@dna;
$data =~ /(.*?) \w+=/;
$record{desc} = ;
%{$record{group}} = $data =~ /\b(\w+)=([^=]+) (?=\w+=)/g;
return \%record;
}
__DATA__
sp|O60341|KDM1A_HUMAN Lysine-specific histone demethylase 1A OS=Homo sapiens OX=9606 GN=KDM1A PE=1 SV=2 MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY
输出
Found: O60341
$VAR1 = {
'x' => 'sp',
'id' => 'O60341',
'group' => {
'OX' => '9606',
'OS' => 'Homo sapiens',
'GN' => 'KDM1A',
'PE' => '1'
},
'desc' => 'KDM1A_HUMAN Lysine-specific histone demethylase 1A',
'dna' => [
'MLSGKKAAAAAAAAAAAATGTEAGPGTAGGSENGSEVAAQPAGLSGPAEVGPGAVGERTP',
'RKKEPPRASPPGGLAEPPGSAGPQAGPTVVPGSATPMETGIAETPEGRRTSRRKRAKVEY'
]
};