将 FASTA 文件读入哈希

Read FASTA file into hash

我正在尝试将 FASTA 文件读入哈希。

问题

my $primer_seq_14bp = q(); // --> works
my $primer_seq_14bp; // --> error
Use of uninitialized value $primer_seq_14bp in hash element at ./script.pl line 24, <FWDPRIMER> line 1.

为什么这个变量需要声明为空字符串?

这是我的第一个“我自己的”脚本的一部分,欢迎提供有关格式、效率等方面的反馈!

代码

#!/usr/bin/perl
use strict; use warnings;

my %primer_name_for;
my $primer_name;
my $primer_seq_14bp;

# Store fwd primer sequences in a sequence{name} hash.
open FWDPRIMER, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
while (<FWDPRIMER>) {
    chomp;
    if ( /^(>.+)/ ) {
        $primer_name = ;
    }
    else {
        $primer_seq = $_;
        $primer_seq_14bp = substr $primer_seq, 0, 14;
    }
    $primer_name_for{$primer_seq_14bp} = $primer_name;
}
close FWDPRIMER;

输入

Fasta 文件,格式为:

>Sequence_name
DNA_sequence
>Sequence_name
DNA_sequence

等等

整个文件都是这种格式,所以我想我可以使用 $_ 因为如果该行不是名称,它必须是一个序列。

目标

一个散列,其中 DNA_sequenceSequence_name 的键。最终的行动是用读取序列搜索哈希以获得读取的生物学名称(读取带有默认名称)。

例子

输入底漆

>snp_fwd_primer
AAGCTCCTGCAGGTCATCTC

输入读取

>read_name
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG

期望的输出

请注意,读取的前 20 bp 与引物序列匹配。该脚本将读取序列剪切为 14 bp,因为这是我最短引物序列的长度。

>snp_fwd_primer
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG

更新

我有一个解决方案,但是下面的答案要好得多。我正在为其他迷路的生物学家分享它(我的解决方案基于生物信息学论坛)。

我更改了记录分隔符 $/ 以一次读取输入入门 fasta 一个条目(我总是将 multi-line fasta 转换为 single-line;对于 multi-fasta 请参阅@ikegami 的回答)。然后使用正则表达式将 header 从序列中分离出来。主要缺点是我必须从输入入门 fasta 中手动删除第一个 >(了解到您无法在 perl 脚本中关闭并重新打开同一文件)。

这不适用于输入 read fasta,因为我只想要序列行,所以我跳过了以 >.

开头的任何行
#!/usr/bin/perl
# Usage: thisScript.pl fwdPrimers.fasta reads.fasta > reads_with_primer_names.fasta
# Fasta files need to be in single-line format.
# Before running, remove the first instance of '>' from the fwdPrimerFastaFile!
# Purpose: for reads that have marker forward primer sequences, rename the read with the marker name.

use strict; use warnings;

open PRIMERFILE, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
$/ = '>';

# Store fwd primer sequences in a sequence{name} hash.
my %primer_name_for;

while (<PRIMERFILE>) {
    /(.+)\n(.+)\n/ and my $primer_name =  and my $primer_seq = ;
    my $primer_seq_14bp = substr $primer_seq, 0, 14;
    $primer_name_for{$primer_seq_14bp} = $primer_name;
}
close PRIMERFILE;

# Store reads in a full_length_read{trimmed_read} hash. 
# Use trimmed read to search the fwd primer hash. 
# Combine primer name and full length read.
open READFILE, '<', $ARGV[1] or die "error reading $ARGV[1]\n";
$/ = "\n";

my %read_fullSeq_for;
my $read_seq;

while (<READFILE>) {
    chomp;
    if($_ =~ /^>/){ 
        next;
    }
    else {$read_seq = $_};
    my $read_seq_14bp = substr $read_seq, 0, 14;
    $read_fullSeq_for{$read_seq_14bp} = $read_seq;
    if (exists $primer_name_for{$read_seq_14bp} and $read_fullSeq_for{$read_seq_14bp}) {
        print ">$primer_name_for{$read_seq_14bp}\n$read_fullSeq_for{$read_seq_14bp}\n";
    }
}
close READFILE;

__END__

假设第一行匹配 /^(>.+)/。然后你最终做

$primer_name_for{undef} = $primer_name;

因为您从未将任何东西分配给

$primer_seq_14bp

您应该只在有序列时才添加到散列中。

my %primer_name_for;
while (<>) {
    chomp;
    if (/^>(.*)/s) {
        $primer_name = ;
    } else {
        my $primer_seq = $_;
        my $primer_seq_14bp = substr($primer_seq, 0, 14);
        $primer_name_for{$primer_seq_14bp} = $primer_name;
    }
}

一个问题。序列可以跨越多行。要解决此问题,我们将在遇到标题或 EOF 时添加到哈希中。

my (%primer_name_for, $name, $seq);
while (<>) {
    chomp;
    if (/^>(.*)/s) {
        $primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
        $name = $_;
        $seq = '';
    } else {
        $seq .= $_;
    }
}

$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);

请调查以下示例代码是否符合您的问题。

我的理解是 OP 试图建立反向查找 table 序列名称作为哈希值,基于部分 DNA 序列作为哈希键。

注意:如果有一些示例输入数据和所需的输出将有助于澄清问题

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $length = 14;
my %sequence = do { local $/; <DATA> =~ />(.*?)\n([^>]*)/gs};

$sequence{$_} =~ s/\n//g for keys %sequence;

my %lookup_table = map { substr($sequence{$_},0,$length) => $_ } keys %sequence;

say Dumper(\%sequence);
say Dumper(\%lookup_table);

exit 0;

__DATA__
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

输出

$VAR1 = {
          'SEQUENCE_2' => 'SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH',
          'SEQUENCE_1' => 'MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL'
        };

$VAR1 = {
          'MTEITAAMVKELRE' => 'SEQUENCE_1',
          'SATVSEINSETDFV' => 'SEQUENCE_2'
        };