perl

Question

我有一个数组，其中每个元素都来自由制表符分隔的一行。

初始代码：

#!/usr/bin/perl -w
use strict;

下面的代码是一段代码。

sub parser_domains {

my @params = @_;

my $interpro_line = "";
my @interpro_vector = ( );
my $idr_sub_id = $params[0];
my $idr_sub_start = $params[1]+1;
my $idr_sub_end = $params[2]+1;
my $interpro_id = "";
my $interpro_start_location = 0;
my $interpro_end_location = 0;
my $interpro_db = "";
my $interpro_length = 0;
my $interpro_db_accession = "";
my $interpro_signature = "";
my $interpro_evalue = "";
my $interpro_vector_size = 0;
my $interpro_sub_file= "";
my $idr_sub_lenght = ($idr_sub_end-$idr_sub_start)+1;

$interpro_sub_file = "$result_directory_predictor/"."$idr_sub_id/"."$idr_sub_id".".fsa.tsv";

#open file; if it fails, print a error and exits.
unless( open(TSV_FILE_DATA, $interpro_sub_file) ) {
        print "Cannot open file \"$interpro_sub_file\"\n\n";
        return;
}
my @interpro_file_line = <TSV_FILE_DATA>;
close TSV_FILE_DATA;

foreach $interpro_line (@interpro_file_line) {
    @interpro_vector = split('\t',$interpro_line);
    $interpro_id = $interpro_vector[0];
    $interpro_db = $interpro_vector[3];
    $interpro_db_accession = $interpro_vector[4];
    $interpro_start_location = $interpro_vector[6];
    $interpro_end_location = $interpro_vector[7];
    $interpro_signature = $interpro_vector[11];
    $interpro_length = ($interpro_end_location-$interpro_start_location) + 1;

    if ($interpro_signature eq ""){

            $interpro_signature = "NOPIR";
            printf IDP_REGION_FILE "\nFound a $interpro_db domain with no IPR: starts at $interpro_start_location and ends at $interpro_end_location\n";
            printf IDP_REGION_FILE "The size of $interpro_db domain in the sequence is $interpro_length\n";
            printf IDP_REGION_FILE "The IDR starts at $idr_sub_start and and ends at $idr_sub_end\n";
            printf IDP_REGION_FILE "The size of IDR is $idr_sub_lenght\n";
            domains_regions($idr_sub_start,$idr_sub_end,$interpro_start_location,$interpro_end_location,$interpro_signature,$interpro_length,$interpro_db,$idr_sub_id,$interpro_db_accession,$idr_sub_lenght);
    }
    else{
        for $entry_line (@entry_file_line) {
            @entry_vector =  split('\t',$entry_line);
            $entry_ac = $entry_vector[0];
            $entry_type = $entry_vector[1];
            $entry_name = $entry_vector[2];
            chomp($entry_name);

            if ($interpro_signature eq $entry_ac) {
                printf IDP_REGION_FILE "\nFound a $interpro_db domain with Interpro Signature $entry_ac: starts at $interpro_start_location and ends at $interpro_end_location\n";
                printf IDP_REGION_FILE "The size of $interpro_db domain in the sequence is $interpro_length\n";
                printf IDP_REGION_FILE "The Interpro Signature $entry_ac belongs to type $entry_type\n";
                printf IDP_REGION_FILE "The name of $entry_ac is $entry_name\n";
                printf IDP_REGION_FILE "The IDR starts at $idr_sub_start and ends at $idr_sub_end\n";
                printf IDP_REGION_FILE "The size of IDR is $idr_sub_lenght\n";  

                domains_regions($idr_sub_start,$idr_sub_end,$interpro_start_location,$interpro_end_location,$interpro_signature,$interpro_length,$interpro_db,$idr_sub_id,$interpro_db_accession,$idr_sub_lenght);
            }
        }
    }
}
}

tsv 文件示例 (interproscan):

P51587  14086411a2cdf1c4cba63020e1622579    3418    Pfam    PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1    2670    2799    7.9E-43 T   15-03-2013
P51587  14086411a2cdf1c4cba63020e1622579    3418    ProSiteProfiles PS50138 BRCA2 repeat profile.   1002    1036    0.0 T   18-03-2013  IPR002093   BRCA2 repeat    GO:0005515|GO:0006302
P51587  14086411a2cdf1c4cba63020e1622579    3418    Gene3D  G3DSA:2.40.50.140       2966    3051    3.1E-52 T   15-03-2013
...

脚本运行完美，但比较 $interpro_signature eq "" 提供了警告。

Use of uninitialized value $interpro_signature in string eq at /home/duca/eclipse-workspace/idps/idp_parser_interpro.pl line 666.

所以，我搜索并尝试了在比较之前将空值替换到数组中的方式。我想要 "NOIPR" 的空值。我正在处理 9 个完整的基因组，并且我有超过 324000 个蛋白质需要解析。

如何替换数组中的空值？

谢谢。

Answer 1

您的数组可能没有 12 个元素（或者第 12 个元素可能 undef）

my $interpro_signature = $interpro_vector[11] // 'some_default_value';

// 是 defined-or operator。

错误Use of uninitialized value表示变量还没有初始化，或者被设置为undef。

请参阅 perldiag 并定期使用它。运行代码 perl -Mdiagnostics ... 定期出错。

use warnings;实际上比-w好。

更新问题的实质性编辑

从显示的数据来看，文件中可能还没有给出其他字段；所以用默认值证明所有变量，就像上面索引 11 处的数组元素一样。无论如何，这就是您通常想要做的。例如，如果文件中有所有字段但有些字段可能为空（两个选项卡之间没有任何内容）
```
my @interpro_defaults = ('id_default', 'db_default', ...);

my ($interpro_id, $interpro_db, ...) = 
    map { 
        $interpro_vector[$_] // $interpro_defaults[$_] 
    } 0 .. $#interpro_defaults;
```
这取决于列表中的（变量）顺序，变量容易出错；见下文。如果某些字段根本不存在，则可能需要做（远）更多的工作。
单独的变量太多，所有变量都相关并命名为 $interpro_X（然后是 $idr_Y 和 $entry_Z，但更少，也许可以管理） .

能不能不把它们捆绑在容器类型的变量或数据结构中？

散列 %interpro 似乎很合适，带有键 X（因此，$interpro{id} 等）。然后您可以更轻松地使用它们，并可以对整个地段执行一些操作。您仍然需要在初始化时注意顺序，因为它们是按顺序读取的，但这样应该更清楚。例如

my @interpro_vars   = qw(id db db_accesssion ...);
my @interpro_vector = qw(id_default db_default ...);
my %interpro;
@interpro{@interpro_vars} = @interpro_vector;
# or simply
@interpro{qw(id db ...)} = qw(id_default db_default ...);

我先用键和值定义了数组，然后再使用它们，以防您以后可能希望在数组中包含这些列表。如果不是这种情况，您可以使用列表（最后一行）初始化散列。

这里

my %h; 
@h{LIST-keys} = LIST-values;

是一种将 LIST-values 的列表分配给 LIST-keys 中给出的散列 %h 的键集的方法。它们按照两个列表的给定顺序一对一分配（大小最好匹配）。在散列的键前面有 @ 印记，因为我们是那里有一个（键的）列表，而不是散列。请注意，哈希必须已在某处声明。参见 slices in perldata。

Answer 2

问题是您的第 3 行仅包含 9 个元素。所以

@interpro_vector = split('\t',$interpro_line);

该行仅将 9 个元素分配给 @interpro_vector 但您随后访问 $interpro_vector[11] （即第 12 个元素）并且不存在。您现在可以检查 @interpro_vector 是否包含（至少）12 个元素：

if (@interpro_vector >= 12) {
    ...
}

或者您可以使用 defined-or operator as 在未定义 $interpro_vector[11] 的情况下使用默认值：

$interpro_signature = $interpro_vector[11] // '';

上面一行相当于

if (defined $interpro_vector[11]) {
    $interpro_signature = $interpro_vector[11];
} else {
    $interpro_signature = '';
}

现在

if ($interpro_signature eq "") {
    ...
}

会起作用。

perl - 如何在不使用变量进行比较的情况下替换数组中的空值？

perl - How to replace an empty value in an array without using a variable for comparison?

arrays

suppress-warnings

is-empty