用于计算字符串中三个不同 ID 出现次数的 Perl 脚本,其中出现次数可能为零

Perl script for counting occurrences of three different IDs in string where occurrence may be zero

我在下面附上了一个示例,其中包含我的输入文件中的 5 行(制表符分隔):

G157    G157.2  535 3   344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157    G157.6  535 42  276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157    G157.9  535 4   344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157    G157.11 535 6   276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs

第二列包含转录本 ID,第五列包含读取映射到该转录本的样本 ID,第六列包含所有这些读取的列表。下面对第六列的结构进行解释:

SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes

有三个样本 ID (276,344,&365),但每个转录本可以覆盖任何一个、两个或所有三个样本。

这是我希望输出的样子:

transcript_id   276 344 365
G157.1  0   0   2
G157.2  0   3   0
G157.6  9   18  15
G157.9  0   4   0
G157.11 6   0   0

我已经能够让它工作,但是因为我对 Perl 比较陌生,所以我无法弄清楚如何完全用 Perl 完成这个任务。我在最后使用第二个 R 脚本拼凑了一个矩阵。这是我的 Perl 脚本:

#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
    $line=$_;
    chomp $line;
    if ($line=~m/\S+/) {
        my ($transcript) = ($line=~m/\S+\s+(\S+)/);
        my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
        my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
        my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
    
       if ($unique_donor_counts !~ m/,/) {
           my $commas = $transcript_info =~ y/,//;
           my $counts = $commas+1;
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);  
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;
               my $counts = $commas+1;
               my ($donor) = ($i=~m/(\d\d\d)/);
               print "$transcript\t$donor\t$counts\n";
           }
       }
    }
}

输出:

G157.1  2   365
G157.2  3   344
G157.6  9   276
G157.6  18  344
G157.6  15  365
G157.9  4   344
G157.11 6   276

我没有打印,而是将它保存到一个输出文件,运行 这个 R 脚本:

library(tidyr)

DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")

我在 Perl 中遇到的问题是,当该行中没有样本 ID 时,我不知道如何获得 0 的计数。我真的很想在 Perl 上做得更好,所以我想弄清楚如何在一个脚本中完成所有这些而不是使用 R 操作矩阵。感谢您花时间帮助 Perl 新手学习!

编辑 post 从我原来的一些进步

我会用很多 split 来完成它。

#!/usr/bin/env perl
use strict;
use warnings; # Prefer over the -w option
use feature qw/say/;

# Reads from standard input/filenames given on command line, prints to
# standard output.

# All samples encountered in the entire input
my %samples;
# The data for each line of input
my @transcripts;

# For each line of input
while (my $line = <<>>) {
    # Split into fields on tabs.
    my @F = split /\t/, $line;

    # Add to the known samples
    my @s = split /,/, $F[4];
    @samples{@s} = (1) x @s;

    my %counts;
    # Split the 6th column on semicolon and iterate over each element
    for my $read (split /;/, $F[5]) {
        # Extract the sample id and increment its count
        my ($id, @reads) = split /[:,]/, $read;
        $counts{$id} += @reads;
    }

    # Save this line's counts
    push @transcripts, [ $F[0], \%counts ];
}

# Sample ids in sorted order
my @samples = sort { $a <=> $b } keys %samples;

# Header line
say join("\t", "transcript_id", @samples);

foreach my $transcript (@transcripts) {
    # And print the counts of each transcript's sample, replacing undef values with 0.
    say join("\t", $$transcript[0], map { $_ // 0 } @{$$transcript[1]}{@samples});
}

示例:

$ perl count.pl input.tsv
transcript_id   276 344 365
G157.2  0   3   0
G157.6  10  18  14
G157.9  0   4   0
G157.11 6   0   0

它在多个地方使用的一件事是 slice, to set/retrieve multiple elements of a hash in one expression. Also helps to be familiar with references, which are used in more complicated data structures 而不仅仅是普通数组和散列。

我将浏览您的程序文件并在需要的地方发表评论。

#!/usr/bin/perl -w
use strict;

-w 开关全局打开警告。您应该使用警告的词法版本:use warnings。它通常被认为是一种更好的做法,因为它允许您在需要时关闭警告,例如:no warnings 'experimental'.

my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

所有这些变量都未使用。所以删除所有未使用的变量。

你不应该在你的程序的顶部声明你所有的变量并使它们都是全局的。您应该将它们声明为尽可能靠近您将要使用它们的地方,并且在尽可能小的范围内。用 my 声明的变量的范围是最近的封闭块:{ block here }。这将封装您的代码,并使其更易于阅读。

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";

这是旧的,两个参数开放,通常被认为是不好的做法,因为它允许代码插入。您希望打开三个参数,并使用显式打开模式。您还希望在您的 die 语句中包含错误报告 $!

open my $fh, "<", $ARGV[0] or die "Cannot open '$ARGV[0]' for reading: $!";

您也不希望在 die 语句的末尾放置换行符,因为那样会抑制错误中的行号。

while(<INFILE>) {
    $line=$_;
    chomp $line;

您不必跳过所有这些障碍。你可以做任何一个

while (my $line = <$fh>) {    # declared in the smallest possible scope
    chomp $line;

while (<$fh>) {
    chomp;        # chomps $_

后者的好处是不必键入要与正则表达式匹配的变量。例如。而不是 $line =~ m/..../ 只需键入 m/..../。这是一个常用的 Perl 习语。

    if ($line=~m/\S+/) {
        my ($transcript)         = ($line=~m/\S+\s+(\S+)/);
        my ($transcript_info)    = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
        my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);

你可以只匹配一次,同时捕获三个值:

my ($trans,$udc,$trans_info) = $line =~ /^\S+\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)\s+(\S+)/;

请注意,变量的顺序必须与捕获的值的顺序相同。

如果您使用 split

,这可能会更简单一些
my @vals = split ' ', $line;
my $trans = $vals[1];        # and so on...

这部分

       my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);

错了。你认为它只是匹配第一个冒号前面的数字,例如344:。如果你只有一个冒号,那就是它的作用。否则它会匹配该行中最后一个冒号之前的所有字符,因为 \S+ 是贪婪的并且会尝试尽可能多地匹配。当您稍后在 ; 上拆分 $unique_donor_ID 时,您将错过最后一个冒号之后的所有数据。可以看到结果here。注意输出的第 2 行以 276 开始并以 365 结束。365 是最后一条记录的开始。

       if ($unique_donor_counts !~ m/,/) {            # checks for 311,322, ..
           my $commas = $transcript_info =~ y/,//;    # count by destroying data
           my $counts = $commas+1;                    # unnecessary extra variable
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);    # splitting the wrong variable
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;              
               my $counts = $commas+1;
               my ($donor) = ($i=~m/(\d\d\d)/);       # risky way to get the first 3 numbers
               print "$transcript\t$donor\t$counts\n";
        .....

您的逻辑是检查供体 ID 短列表中的逗号,并分别处理单个值。这很可能是一种有缺陷的方法。为什么不拆分 ; 上的最后一个字段并从那里处理结果?


use strict;
use warnings;
use feature 'say';

my %dat;             # file data
my @donors;          # store donor ids
my @trans;           # store the transcript names in original order
while (<DATA>) {
    my @f = split;                      # split line on whitespace
    my $trans = $f[1];                  # second field is transcript
    push @trans, $trans;                # save trans for later
    my $tinfo = $f[5];                  # 6th field is our payload
    my @data = split /;/, $tinfo;       # split payload
    for (@data) {                       # for each data segment
        my ($donor) = /^(\d+):/;    # ... extract donor id
        my $count = () = /,/g;      # ... count commas
        $count++;                   # ... add 1 comma for whatever reason
        #say join "\t", $trans, $count, $donor;  # optional print
        push @donors, $donor;       # store donor ids
        $dat{$trans}{$donor} = $count;   # this is the data we want
    }
}

sub uniq {         # for deduping
    my %seen;
    grep { !$seen{$_}++ } @_;
}
@donors= sort { $a <=> $b } uniq(@donors);   # dedupe and sort
@trans  = uniq(@trans);                      # just dedupe, preserve original order from file
say join "\t", "transcript_id",  @donors;    # print header
for my $key (@trans) {
    # for each donor id, for each key, print number, or 0 if undefined
    say join "\t", $key, map { $dat{$key}{$_} // 0 } @donors;
}

__DATA__
G157    G157.2  535 3   344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157    G157.6  535 42  276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157    G157.9  535 4   344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157    G157.11 535 6   276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs

注意这两行

my ($donor) = /^(\d+):/;    
my $count = () = /,/g;      

正在使用隐式模式匹配,/,/g$_ =~ /,/g含义相同。在第一种情况下,我们匹配并捕获,并分配给 $donor。在第二种情况下,我们强制使用 = () = 进行匹配计数,这是一种用于更改正则表达式上下文的 Perl 技巧。这与 y/,// 的作用相同,只是它不是破坏性的,并且还可以与适当的正则表达式一起使用而不是单个字符。

输出:

transcript_id   276     344     365
G157.2  0       3       0
G157.6  10      18      14
G157.9  0       4       0
G157.11 6       0       0