文本文件中计数正值的脚本更正

Script correction for count positive values in text file

前段时间,我请求帮助生成一个 Perl 脚本,该脚本计算文本文件中的值,分为多个部分。 当文本文件的某些行中出现正值时,此脚本会告诉我,然后当开始文本的另一部分时,再次告诉我正值的数量。 例如,这是我的文本文件:

;YP_003858584.1_BtCoVBM48_gp2   25 NKSP   0.1462     (9/9)   ---   
;YP_003858584.1_BtCoVBM48_gp2   66 NLTW   0.7837     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  116 NTTQ   0.7013     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  126 NGTH   0.7112     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  163 NCTY   0.7620     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  173 NIST   0.6556     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  231 NITY   0.7442     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  273 NGTI   0.7109     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  322 NITQ   0.6116     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  334 NITS   0.7296     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  361 NSSA   0.5388     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  462 NPSG   0.4656     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  541 NSTK   0.5883     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  590 NASS   0.5643     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  603 NCTD   0.7117     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  646 NSSY   0.5467     (4/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  665 NVSS   0.7980     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  695 NNTI   0.4537     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  703 NFSI   0.5613     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  787 NFSQ   0.6209     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2 1060 NFTT   0.4540     (6/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1084 NGTH   0.5408     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1120 NNTV   0.5803     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1144 NHTS   0.3828     (8/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1149 NVSL   0.4879     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1159 NASV   0.5021     (3/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1180 NESL   0.5770     (7/9)   +     
;ADK66841.1_NA   25 NKSP   0.1462     (9/9)   ---   
;ADK66841.1_NA   66 NLTW   0.7837     (9/9)   +++   
;ADK66841.1_NA  116 NTTQ   0.7013     (9/9)   ++    
;ADK66841.1_NA  126 NGTH   0.7112     (9/9)   ++    
;ADK66841.1_NA  163 NCTY   0.7620     (9/9)   +++   
;ADK66841.1_NA  173 NIST   0.6556     (8/9)   +     
;ADK66841.1_NA  231 NITY   0.7442     (9/9)   ++    
;ADK66841.1_NA  273 NGTI   0.7109     (9/9)   ++    
;ADK66841.1_NA  322 NITQ   0.6116     (8/9)   +     
;ADK66841.1_NA  334 NITS   0.7296     (9/9)   ++    
;ADK66841.1_NA  361 NSSA   0.5388     (6/9)   +     
;ADK66841.1_NA  462 NPSG   0.4656     (5/9)   -     
;ADK66841.1_NA  541 NSTK   0.5883     (8/9)   +     
;ADK66841.1_NA  590 NASS   0.5643     (6/9)   +     
;ADK66841.1_NA  603 NCTD   0.7117     (9/9)   ++    
;ADK66841.1_NA  646 NSSY   0.5467     (4/9)   +     
;ADK66841.1_NA  665 NVSS   0.7980     (9/9)   +++   
;ADK66841.1_NA  695 NNTI   0.4537     (5/9)   -     
;ADK66841.1_NA  703 NFSI   0.5613     (9/9)   ++    
;ADK66841.1_NA  787 NFSQ   0.6209     (9/9)   ++    
;ADK66841.1_NA 1060 NFTT   0.4540     (6/9)   -     
;ADK66841.1_NA 1084 NGTH   0.5408     (6/9)   +     
;ADK66841.1_NA 1120 NNTV   0.5803     (6/9)   +     
;ADK66841.1_NA 1144 NHTS   0.3828     (8/9)   -     
;ADK66841.1_NA 1149 NVSL   0.4879     (5/9)   -     
;ADK66841.1_NA 1159 NASV   0.5021     (3/9)   +     
;ADK66841.1_NA 1180 NESL   0.5770     (7/9)   +     

此文件在出现正值时向我报告:只有 0.7 >= 是正值。文本文件有两部分:一部分用于 YP_003858584.1_BtCoVBM48_gp2,另一部分用于 ADK66841.1_NA.当你统计每个部分的正值(7>=)个数时,每个部分有 9 个正值。 我有很多这样的文件,有数百个部分,因此,我想知道一个关于 Perl 中的脚本来计算这些值的想法。 这是脚本:

use strict;
use warnings;

my $cnt = {};
while(my $line = <STDIN>) {
    if($. == 1) {
        next;
    }else {
        my @cols = split(m{\s+},$line);
        if(@cols == 6) {
            my $potential = $cols[3];
            my $id = $cols[0];
            $id =~ s{^\;}{};
            if(0.7 >= $potential) {
                $cnt->{$id}++;
            };
        };
    };
};

my @ids_found = sort { $a cmp $b } (keys %$cnt);

for my $id (@ids_found) {
    print "PART $id:\n";
    print "$cnt->{$id} (values 0.7 >=)\n";
};

这工作正常,但是,我注意到输出中有错误。 输出:

$ cat Test00.txt | perl File_for_count_values.pl 
PART ADK66841.1_NA:
18 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
18 (values 0.7 >=)

输出看起来不像我想要的那样,当计算这个脚本的值时加上每个部分的正值 (9 + 9 = 18)。 输出必须是:

$ cat Test00.txt | perl File_for_count_values.pl 
PART ADK66841.1_NA:
9 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
9 (values 0.7 >=)

知道必须在脚本中更改哪些内容才能做到这一点吗?

欢迎任何评论。

您的代码计算 小于或等于 0.7 的值。
如果我改变:

        if(0.7 >= $potential) {

至:

        if(0.7 <= $potential) {

然后每个部分我得到9分。输出:

PART ADK66841.1_NA:
9 (values 0.7 >=)
PART YP_003858584.1_BtCoVBM48_gp2:
9 (values 0.7 >=)

请调查以下 re-worked perl 脚本是否有用。

注意:原始代码假定 header 基于指令 if($. == 1) -- 参见 $.

实施了一些更改以提高脚本的可读性

  • 在脚本顶部定义的变量$threshold
  • next unless $. > 1跳过header/first行(下一步,除非行计数器超过一个)
  • 不仅在空格上分割线而且;也避免替代
  • $id,$potential 在一条指令中从 @cols 数组中填充
  • 字段编号调整为 ; 之前的第一个字段将为空
  • write with format 用于格式化输出

注意:参见 $~,它定义了 write 输出的当前格式,用于关闭 table

此脚本使用 __DATA__ 块和最初发布的数据用于输出演示目的。

while( <> ) 代替 while( <DATA> ) 来改变代码,这样你就可以接受来自 STDIN 的输入,或者通过将文件名指定为脚本的参数(运行 作为./script.pl file.dat).

#!/usr/bin/env perl
#
# vim: ai ts=4 sw=4

use strict;
use warnings;

my($id,$counter);
my $treshold = 0.7;

while( <DATA> ) {
    chomp;
    next unless $. > 1;
    my @cols = split("[; ]+", $_);
    next unless @cols == 7;
    my($id,$potential) = @cols[1,4];
    $counter->{$id}++ if $potential >= $treshold;
}

my @sorted_ids = sort { $a cmp $b } keys %$counter;

for $id (@sorted_ids) {
    write;
}

$~ = "STDOUT_BOTTOM";
write;

exit 0;

format STDOUT_TOP =

Criteria:          potential >= @#.##
$treshold

+-----------------------------+-------+
| Part                        | Count |
+-----------------------------+-------+
.

format STDOUT =
| @<<<<<<<<<<<<<<<<<<<<<<<<<< | @>>>> |
$id,$counter->{$id}
.

format STDOUT_BOTTOM =
+-----------------------------+-------+

.

__DATA__
;YP_003858584.1_BtCoVBM48_gp2   25 NKSP   0.1462     (9/9)   ---   
;YP_003858584.1_BtCoVBM48_gp2   66 NLTW   0.7837     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  116 NTTQ   0.7013     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  126 NGTH   0.7112     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  163 NCTY   0.7620     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  173 NIST   0.6556     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  231 NITY   0.7442     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  273 NGTI   0.7109     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  322 NITQ   0.6116     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  334 NITS   0.7296     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  361 NSSA   0.5388     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  462 NPSG   0.4656     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  541 NSTK   0.5883     (8/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  590 NASS   0.5643     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  603 NCTD   0.7117     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  646 NSSY   0.5467     (4/9)   +     
;YP_003858584.1_BtCoVBM48_gp2  665 NVSS   0.7980     (9/9)   +++   
;YP_003858584.1_BtCoVBM48_gp2  695 NNTI   0.4537     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2  703 NFSI   0.5613     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2  787 NFSQ   0.6209     (9/9)   ++    
;YP_003858584.1_BtCoVBM48_gp2 1060 NFTT   0.4540     (6/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1084 NGTH   0.5408     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1120 NNTV   0.5803     (6/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1144 NHTS   0.3828     (8/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1149 NVSL   0.4879     (5/9)   -     
;YP_003858584.1_BtCoVBM48_gp2 1159 NASV   0.5021     (3/9)   +     
;YP_003858584.1_BtCoVBM48_gp2 1180 NESL   0.5770     (7/9)   +     
;ADK66841.1_NA   25 NKSP   0.1462     (9/9)   ---   
;ADK66841.1_NA   66 NLTW   0.7837     (9/9)   +++   
;ADK66841.1_NA  116 NTTQ   0.7013     (9/9)   ++    
;ADK66841.1_NA  126 NGTH   0.7112     (9/9)   ++    
;ADK66841.1_NA  163 NCTY   0.7620     (9/9)   +++   
;ADK66841.1_NA  173 NIST   0.6556     (8/9)   +     
;ADK66841.1_NA  231 NITY   0.7442     (9/9)   ++    
;ADK66841.1_NA  273 NGTI   0.7109     (9/9)   ++    
;ADK66841.1_NA  322 NITQ   0.6116     (8/9)   +     
;ADK66841.1_NA  334 NITS   0.7296     (9/9)   ++    
;ADK66841.1_NA  361 NSSA   0.5388     (6/9)   +     
;ADK66841.1_NA  462 NPSG   0.4656     (5/9)   -     
;ADK66841.1_NA  541 NSTK   0.5883     (8/9)   +     
;ADK66841.1_NA  590 NASS   0.5643     (6/9)   +     
;ADK66841.1_NA  603 NCTD   0.7117     (9/9)   ++    
;ADK66841.1_NA  646 NSSY   0.5467     (4/9)   +     
;ADK66841.1_NA  665 NVSS   0.7980     (9/9)   +++   
;ADK66841.1_NA  695 NNTI   0.4537     (5/9)   -     
;ADK66841.1_NA  703 NFSI   0.5613     (9/9)   ++    
;ADK66841.1_NA  787 NFSQ   0.6209     (9/9)   ++    
;ADK66841.1_NA 1060 NFTT   0.4540     (6/9)   -     
;ADK66841.1_NA 1084 NGTH   0.5408     (6/9)   +     
;ADK66841.1_NA 1120 NNTV   0.5803     (6/9)   +     
;ADK66841.1_NA 1144 NHTS   0.3828     (8/9)   -     
;ADK66841.1_NA 1149 NVSL   0.4879     (5/9)   -     
;ADK66841.1_NA 1159 NASV   0.5021     (3/9)   +     
;ADK66841.1_NA 1180 NESL   0.5770     (7/9)   +     

输出


Criteria:          potential >=  0.70

+-----------------------------+-------+
| Part                        | Count |
+-----------------------------+-------+
| ADK66841.1_NA               |     9 |
| YP_003858584.1_BtCoVBM48_gp |     9 |
+-----------------------------+-------+

注:

您在 GitHub 上向我推荐的文件不包含数据文件中的前导 ;。由于这个原因数字字段的计数减少了一个,导致没有得到任何结果。

请在 perl 脚本中进行以下更改:

       next unless @cols == 7;
       my($id,$potential) = @cols[1,4];

       next unless @cols == 6;
       my($id,$potential) = @cols[0,3];