计算正值的 Perl 脚本不报告所有正值

Question

我需要一些有关 Perl 脚本的帮助。该脚本告诉我零件的值 >= 0.5 的数量。这是脚本：

use strict;
use warnings;

my $file = "229E_O.csv";
my @filearray = ();
my @array_ids = ();
my $thres = 0.5;

open (F, $file) or die;
while(my $l = <F>) {
    $l =~ s/\n//g;
    $l =~ s/\r//g;
    my @cols = split(/\s+/, $l); # divide columns for mora than one space
    next unless (scalar (@cols) == 8); ### If there aren´t 8 column, don´t add to array
    push @filearray, $l;
    my $current_id = $cols[0];
    push @array_ids, $current_id;
}

close F;
my @nr_array_ids = uniq(@array_ids);
foreach my $new_id (@nr_array_ids) { ### for each ID not redundant
    my $counter = 0;
    my $total = 0;
    foreach my $new_L (@filearray) { ### for each line in the file
        my @n_cols = split(/\s+/, $new_L);
        my $potential = $n_cols[5];
        my $idd = $n_cols[0];
        if ($new_id eq $idd) {
            ++$total;
        }

        if ( ($new_id eq $idd) and ($potential >= $thres) ) {
            ++$counter;
        }
    }

    print "$new_id\t$counter\t$total\n";
}
sub uniq {
    my %seen;
    return grep { !$seen{$_}++ } @_;
}

这里是输入文件：

APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        671     671     0.134197        .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        672     672     0.282583        .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        676     676     0.290996        .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        680     680     0.376348        .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        682     682     0.552045        .       .       #POSITIVE
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        688     688     0.315533        .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        696     696     0.111705        .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        700     700     0.20703 .       .
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        701     701     0.284842        .       .

输入有 8 列 (0-8)，第 5 列的值对我很重要。当此值 >= 0.5 时，表示 #POSITIVE。该脚本工作正常，但是，我用于报告其他值 (>=0.7)。现在，当更改值时，脚本不会在值 >=0.5 时报告我这是输出：

APT69890_1_NA   0   197
AFR79257_1_NA   0   198
AGT21345_1_NA   0   200
QJY77970_1_NA   0   199
QJY77962_1_NA   0   200
QEO75985_1_NA   0   199
ARK08620_1_NA   0   202

如果你能看到每个ID，例如，APT69890_1_NA是输出的“一部分”。第二列是#POSITIVE，第三列是所有值 <0.5 这里第二列的APT69890_1_NA部分必须是1，但是，值为0。如果你想看的话，这里是我的真实数据的完整示例：https://github.com/MauriAndresMU1313/Example_NetOGlyc/tree/main

Answer 1

这行代码要求每行数据只有 8 列，否则该行将被忽略：

next unless (scalar (@cols) == 8); ### If there aren´t 8 column, don´t add to array

但是在您的数据文件中，您想要计算的一行（数据的第 253 行 - 0.552045）- 添加了一个额外的列“#POSITIVE”，共 9 列。

APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        682     682     0.552045        .       .       #POSITIVE

该行因此被拒绝。您的总数 197 比 APT69890_1_NA 的数据文件中的条目总数少一个。进一步证明这是数据错误。

删除第 9 列或更改条件以允许第 9 列。

如果您想记录您的数据 - 那么在这种情况下，您可以简单地删除#POSITIVE（第 9 列）并将其移至前一行。然后它将被忽略，因为它只包含一列：

#POSITIVE
APT69890_1_NA   netOGlyc-4.0.0.13       CARBOHYD        682     682     0.552045        .       .

或者，您可以更改条件以允许至少 8 列。这样做的好处是保留数据原样，但削弱了数据的有效性检查。:

next unless (scalar (@cols) >= 8); ### If there are less than 8 columns, don´t add to array

Answer 2

请调查以下脚本是否符合您的任务。

注意：脚本假定您在 github

上提供的数据文件一致

定义$threshold
分配散列来保存 %data
在while( <> )循环中逐行读取文件
尝试拆分行并取 $id 和 $value
如果 $id 未定义则跳到下一行
增加这个特定的计数器$id
如果未定义，则将正计数器初始化为0
如果 $val >= $threshold
当所有数据读取完成后，for循环输出结果

注意：运行为 ./script.pl datafile

use strict;
use warnings;
use feature 'say';

my $threshold = 0.5;
my %data;

while( <> ) {
    my($id,$val) = (split)[0,5];
    next unless $id;
    $data{$id}{count}++;
    $data{$id}{pos} //= 0;
    $data{$id}{pos}++ if $val >= $threshold;
}


say join("\t",$_,$data{$_}->@{qw/pos count/}) for sort keys %data;

计算正值的 Perl 脚本不报告所有正值

Perl script to count positive values doesn´t report all positive values

perl