添加新的哈希键，然后在新文件中打印

Question

之前，我 post 提出一个问题来寻找使用正则表达式匹配特定序列标识 (ID) 的答案。现在我正在寻找一些建议来打印我正在寻找的数据。如果您想查看完整文件，here's a GitHub link.

此脚本需要两个文件才能工作。第一个文件是这样的（这只是文件的一部分）：

AGY29650_2_NA   netOGlyc-4.0.0.13       CARBOHYD        2       2       0.0804934       .       .       
AGY29650_2_NA   netOGlyc-4.0.0.13       CARBOHYD        4       4       0.0925522       .       .       
AGY29650_2_NA   netOGlyc-4.0.0.13       CARBOHYD        13      13      0.0250116       .       .       
AGY29650_2_NA   netOGlyc-4.0.0.13       CARBOHYD        23      23      0.565981        .       .      
...

这个文件告诉我什么时候有一个值 >= 0.5，这个信息在第六列。发生这种情况时，我的脚本会使用第一列（这是一个 ID，与第二个文件匹配）和第四列（这是第二个文件中字母的位置）。

这是我的第二个文件（这只是一部分）：

>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*

就像我之前说的，当第一个文件中的 ID 与第二个文件中的 ID 相同时，脚本会获取这些 ID，然后在数据内容中搜索位置（第四列）。

这里举个例子，在文件一中第四行是一个正值（>=0.5），第四列的位置是23。然后脚本在第二个文件的数据内容中查找位置23，这里位置23是一个字母T:

MTYSVFPLMCLLTFIGANAKIV T LP

当脚本与字母匹配时，查找右2个字母和左2个字母到感兴趣的位置：

IVTLP

在之前的post中，感谢Stack中的一些人的帮助我解决了这个问题，因为每个文件中的ID不同（差异是这样的：AGY29650_2_NA（文件一）和 AGY29650.2（文件二））。现在我寻求帮助以获得完成脚本所需的输出。该脚本不完整，因为我找不到打印感兴趣的输出的方法，在本例中，第二个文件中的 5 个字母（第一个文件中出现的位置的一个字母）右侧 2 个字母，左侧 2 个字母。我有数千个像一和二这样的文件，现在我需要一些帮助来根据您推荐的任何想法来完成脚本。这是脚本：

use strict;
use warnings;
use Bio::SeqIO;

my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my @array_residues = (); 
my $seqio_obj = Bio::SeqIO->new(-file => $in,
                             -format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
  my $dd =  $seq_obj->id;
  my $ss =  $seq_obj->seq;
  ###my $ee =  $seq_obj->desc;
  $fastadata{$dd} = "$ss";
}

my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5

# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
    $one =~ s/\n//g;
    $one =~ s/\r//g;
    my @cols = split(/\s+/, $one); ### split columns
    next unless (scalar (@cols) == 7); ### the line must have 7 columns to add to the array
    my $val = $cols[5];

    if ($val >= 0.5) {
        my $position = $cols[3];
        my $id_list = $cols[0];
        $id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/.|/;
        if (exists($fastadata{$id_list})) {
            my $new_seq = $fastadata{$id_list};
            my $subresidues = substr($new_seq, $position -3, 6);

        } 
    }
}

close F;

我想添加一个 push 函数来生成新数据，然后在新文件中 print。

我的预期输出是打印正值 (>=0.5) 的位置，在本例中为 T（位置 23）以及右侧 2 个字母和左侧 2 个字母。在这种情况下，使用 GitHub 中的数据示例（上面的 link），预期输出为：

IVTLP

欢迎任何建议或帮助。

谢谢！

Answer 1

主要问题似乎是该行有 8 列，而不是脚本中假定的 7 列。另一个小问题是提取的子字符串应该有 5 个字符，而不是脚本假设的 6 个。这是适合我的循环的修改版本：

open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
    chomp $one;
    my @cols = split(/\s+/, $one); ### split columns
    next unless (scalar @cols) == 8; ### the line must have 8 columns to add to the array
    my $val = $cols[5];
    if ($val >= 0.5) {
        my $position = $cols[3];
        my $id_list = $cols[0];
        $id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/.|/;
        if (exists($fastadata{$id_list})) {
            my $new_seq = $fastadata{$id_list};
            my $subresidues = substr($new_seq, $position -3, 5);
            print $subresidues, "\n";
        }
    }
}

添加新的哈希键，然后在新文件中打印

Add new hash keys and then print in a new file

perl