将 FASTA 文件读入哈希
Read FASTA file into hash
我正在尝试将 FASTA 文件读入哈希。
问题
my $primer_seq_14bp = q(); // --> works
my $primer_seq_14bp; // --> error
Use of uninitialized value $primer_seq_14bp in hash element at ./script.pl line 24, <FWDPRIMER> line 1.
为什么这个变量需要声明为空字符串?
这是我的第一个“我自己的”脚本的一部分,欢迎提供有关格式、效率等方面的反馈!
代码
#!/usr/bin/perl
use strict; use warnings;
my %primer_name_for;
my $primer_name;
my $primer_seq_14bp;
# Store fwd primer sequences in a sequence{name} hash.
open FWDPRIMER, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
while (<FWDPRIMER>) {
chomp;
if ( /^(>.+)/ ) {
$primer_name = ;
}
else {
$primer_seq = $_;
$primer_seq_14bp = substr $primer_seq, 0, 14;
}
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
close FWDPRIMER;
输入
Fasta 文件,格式为:
>Sequence_name
DNA_sequence
>Sequence_name
DNA_sequence
等等
整个文件都是这种格式,所以我想我可以使用 $_
因为如果该行不是名称,它必须是一个序列。
目标
一个散列,其中 DNA_sequence
是 Sequence_name
的键。最终的行动是用读取序列搜索哈希以获得读取的生物学名称(读取带有默认名称)。
例子
输入底漆
>snp_fwd_primer
AAGCTCCTGCAGGTCATCTC
输入读取
>read_name
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG
期望的输出
请注意,读取的前 20 bp 与引物序列匹配。该脚本将读取序列剪切为 14 bp,因为这是我最短引物序列的长度。
>snp_fwd_primer
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG
更新
我有一个解决方案,但是下面的答案要好得多。我正在为其他迷路的生物学家分享它(我的解决方案基于生物信息学论坛)。
我更改了记录分隔符 $/
以一次读取输入入门 fasta 一个条目(我总是将 multi-line fasta 转换为 single-line;对于 multi-fasta 请参阅@ikegami 的回答)。然后使用正则表达式将 header 从序列中分离出来。主要缺点是我必须从输入入门 fasta 中手动删除第一个 >
(了解到您无法在 perl 脚本中关闭并重新打开同一文件)。
这不适用于输入 read fasta,因为我只想要序列行,所以我跳过了以 >
.
开头的任何行
#!/usr/bin/perl
# Usage: thisScript.pl fwdPrimers.fasta reads.fasta > reads_with_primer_names.fasta
# Fasta files need to be in single-line format.
# Before running, remove the first instance of '>' from the fwdPrimerFastaFile!
# Purpose: for reads that have marker forward primer sequences, rename the read with the marker name.
use strict; use warnings;
open PRIMERFILE, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
$/ = '>';
# Store fwd primer sequences in a sequence{name} hash.
my %primer_name_for;
while (<PRIMERFILE>) {
/(.+)\n(.+)\n/ and my $primer_name = and my $primer_seq = ;
my $primer_seq_14bp = substr $primer_seq, 0, 14;
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
close PRIMERFILE;
# Store reads in a full_length_read{trimmed_read} hash.
# Use trimmed read to search the fwd primer hash.
# Combine primer name and full length read.
open READFILE, '<', $ARGV[1] or die "error reading $ARGV[1]\n";
$/ = "\n";
my %read_fullSeq_for;
my $read_seq;
while (<READFILE>) {
chomp;
if($_ =~ /^>/){
next;
}
else {$read_seq = $_};
my $read_seq_14bp = substr $read_seq, 0, 14;
$read_fullSeq_for{$read_seq_14bp} = $read_seq;
if (exists $primer_name_for{$read_seq_14bp} and $read_fullSeq_for{$read_seq_14bp}) {
print ">$primer_name_for{$read_seq_14bp}\n$read_fullSeq_for{$read_seq_14bp}\n";
}
}
close READFILE;
__END__
假设第一行匹配 /^(>.+)/
。然后你最终做
$primer_name_for{undef} = $primer_name;
因为您从未将任何东西分配给
$primer_seq_14bp
您应该只在有序列时才添加到散列中。
my %primer_name_for;
while (<>) {
chomp;
if (/^>(.*)/s) {
$primer_name = ;
} else {
my $primer_seq = $_;
my $primer_seq_14bp = substr($primer_seq, 0, 14);
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
}
一个问题。序列可以跨越多行。要解决此问题,我们将在遇到标题或 EOF 时添加到哈希中。
my (%primer_name_for, $name, $seq);
while (<>) {
chomp;
if (/^>(.*)/s) {
$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
$name = $_;
$seq = '';
} else {
$seq .= $_;
}
}
$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
请调查以下示例代码是否符合您的问题。
我的理解是 OP 试图建立反向查找 table 序列名称作为哈希值,基于部分 DNA 序列作为哈希键。
注意:如果有一些示例输入数据和所需的输出将有助于澄清问题
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $length = 14;
my %sequence = do { local $/; <DATA> =~ />(.*?)\n([^>]*)/gs};
$sequence{$_} =~ s/\n//g for keys %sequence;
my %lookup_table = map { substr($sequence{$_},0,$length) => $_ } keys %sequence;
say Dumper(\%sequence);
say Dumper(\%lookup_table);
exit 0;
__DATA__
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
输出
$VAR1 = {
'SEQUENCE_2' => 'SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH',
'SEQUENCE_1' => 'MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL'
};
$VAR1 = {
'MTEITAAMVKELRE' => 'SEQUENCE_1',
'SATVSEINSETDFV' => 'SEQUENCE_2'
};
我正在尝试将 FASTA 文件读入哈希。
问题
my $primer_seq_14bp = q(); // --> works
my $primer_seq_14bp; // --> error
Use of uninitialized value $primer_seq_14bp in hash element at ./script.pl line 24, <FWDPRIMER> line 1.
为什么这个变量需要声明为空字符串?
这是我的第一个“我自己的”脚本的一部分,欢迎提供有关格式、效率等方面的反馈!
代码
#!/usr/bin/perl
use strict; use warnings;
my %primer_name_for;
my $primer_name;
my $primer_seq_14bp;
# Store fwd primer sequences in a sequence{name} hash.
open FWDPRIMER, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
while (<FWDPRIMER>) {
chomp;
if ( /^(>.+)/ ) {
$primer_name = ;
}
else {
$primer_seq = $_;
$primer_seq_14bp = substr $primer_seq, 0, 14;
}
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
close FWDPRIMER;
输入
Fasta 文件,格式为:
>Sequence_name
DNA_sequence
>Sequence_name
DNA_sequence
等等
整个文件都是这种格式,所以我想我可以使用 $_
因为如果该行不是名称,它必须是一个序列。
目标
一个散列,其中 DNA_sequence
是 Sequence_name
的键。最终的行动是用读取序列搜索哈希以获得读取的生物学名称(读取带有默认名称)。
例子
输入底漆
>snp_fwd_primer
AAGCTCCTGCAGGTCATCTC
输入读取
>read_name
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG
期望的输出
请注意,读取的前 20 bp 与引物序列匹配。该脚本将读取序列剪切为 14 bp,因为这是我最短引物序列的长度。
>snp_fwd_primer
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG
更新
我有一个解决方案,但是下面的答案要好得多。我正在为其他迷路的生物学家分享它(我的解决方案基于生物信息学论坛)。
我更改了记录分隔符 $/
以一次读取输入入门 fasta 一个条目(我总是将 multi-line fasta 转换为 single-line;对于 multi-fasta 请参阅@ikegami 的回答)。然后使用正则表达式将 header 从序列中分离出来。主要缺点是我必须从输入入门 fasta 中手动删除第一个 >
(了解到您无法在 perl 脚本中关闭并重新打开同一文件)。
这不适用于输入 read fasta,因为我只想要序列行,所以我跳过了以 >
.
#!/usr/bin/perl
# Usage: thisScript.pl fwdPrimers.fasta reads.fasta > reads_with_primer_names.fasta
# Fasta files need to be in single-line format.
# Before running, remove the first instance of '>' from the fwdPrimerFastaFile!
# Purpose: for reads that have marker forward primer sequences, rename the read with the marker name.
use strict; use warnings;
open PRIMERFILE, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
$/ = '>';
# Store fwd primer sequences in a sequence{name} hash.
my %primer_name_for;
while (<PRIMERFILE>) {
/(.+)\n(.+)\n/ and my $primer_name = and my $primer_seq = ;
my $primer_seq_14bp = substr $primer_seq, 0, 14;
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
close PRIMERFILE;
# Store reads in a full_length_read{trimmed_read} hash.
# Use trimmed read to search the fwd primer hash.
# Combine primer name and full length read.
open READFILE, '<', $ARGV[1] or die "error reading $ARGV[1]\n";
$/ = "\n";
my %read_fullSeq_for;
my $read_seq;
while (<READFILE>) {
chomp;
if($_ =~ /^>/){
next;
}
else {$read_seq = $_};
my $read_seq_14bp = substr $read_seq, 0, 14;
$read_fullSeq_for{$read_seq_14bp} = $read_seq;
if (exists $primer_name_for{$read_seq_14bp} and $read_fullSeq_for{$read_seq_14bp}) {
print ">$primer_name_for{$read_seq_14bp}\n$read_fullSeq_for{$read_seq_14bp}\n";
}
}
close READFILE;
__END__
假设第一行匹配 /^(>.+)/
。然后你最终做
$primer_name_for{undef} = $primer_name;
因为您从未将任何东西分配给
$primer_seq_14bp
您应该只在有序列时才添加到散列中。
my %primer_name_for;
while (<>) {
chomp;
if (/^>(.*)/s) {
$primer_name = ;
} else {
my $primer_seq = $_;
my $primer_seq_14bp = substr($primer_seq, 0, 14);
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
}
一个问题。序列可以跨越多行。要解决此问题,我们将在遇到标题或 EOF 时添加到哈希中。
my (%primer_name_for, $name, $seq);
while (<>) {
chomp;
if (/^>(.*)/s) {
$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
$name = $_;
$seq = '';
} else {
$seq .= $_;
}
}
$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
请调查以下示例代码是否符合您的问题。
我的理解是 OP 试图建立反向查找 table 序列名称作为哈希值,基于部分 DNA 序列作为哈希键。
注意:如果有一些示例输入数据和所需的输出将有助于澄清问题
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $length = 14;
my %sequence = do { local $/; <DATA> =~ />(.*?)\n([^>]*)/gs};
$sequence{$_} =~ s/\n//g for keys %sequence;
my %lookup_table = map { substr($sequence{$_},0,$length) => $_ } keys %sequence;
say Dumper(\%sequence);
say Dumper(\%lookup_table);
exit 0;
__DATA__
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
输出
$VAR1 = {
'SEQUENCE_2' => 'SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH',
'SEQUENCE_1' => 'MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL'
};
$VAR1 = {
'MTEITAAMVKELRE' => 'SEQUENCE_1',
'SATVSEINSETDFV' => 'SEQUENCE_2'
};