用于计算字符串中三个不同 ID 出现次数的 Perl 脚本,其中出现次数可能为零
Perl script for counting occurrences of three different IDs in string where occurrence may be zero
我在下面附上了一个示例,其中包含我的输入文件中的 5 行(制表符分隔):
G157 G157.2 535 3 344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157 G157.6 535 42 276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157 G157.9 535 4 344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157 G157.11 535 6 276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs
第二列包含转录本 ID,第五列包含读取映射到该转录本的样本 ID,第六列包含所有这些读取的列表。下面对第六列的结构进行解释:
SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes
有三个样本 ID (276,344,&365),但每个转录本可以覆盖任何一个、两个或所有三个样本。
这是我希望输出的样子:
transcript_id 276 344 365
G157.1 0 0 2
G157.2 0 3 0
G157.6 9 18 15
G157.9 0 4 0
G157.11 6 0 0
我已经能够让它工作,但是因为我对 Perl 比较陌生,所以我无法弄清楚如何完全用 Perl 完成这个任务。我在最后使用第二个 R 脚本拼凑了一个矩阵。这是我的 Perl 脚本:
#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash,
$inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron,
$countGene, %unique, );
open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
$line=$_;
chomp $line;
if ($line=~m/\S+/) {
my ($transcript) = ($line=~m/\S+\s+(\S+)/);
my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
if ($unique_donor_counts !~ m/,/) {
my $commas = $transcript_info =~ y/,//;
my $counts = $commas+1;
print "$transcript\t$unique_donor_ID\t$counts\n";
} else {
my @spl = split(';', $unique_donor_ID);
foreach my $i (@spl) {
my $commas = $i =~ y/,//;
my $counts = $commas+1;
my ($donor) = ($i=~m/(\d\d\d)/);
print "$transcript\t$donor\t$counts\n";
}
}
}
}
输出:
G157.1 2 365
G157.2 3 344
G157.6 9 276
G157.6 18 344
G157.6 15 365
G157.9 4 344
G157.11 6 276
我没有打印,而是将它保存到一个输出文件,运行 这个 R 脚本:
library(tidyr)
DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")
我在 Perl 中遇到的问题是,当该行中没有样本 ID 时,我不知道如何获得 0 的计数。我真的很想在 Perl 上做得更好,所以我想弄清楚如何在一个脚本中完成所有这些而不是使用 R 操作矩阵。感谢您花时间帮助 Perl 新手学习!
编辑 post 从我原来的一些进步
我会用很多 split
来完成它。
#!/usr/bin/env perl
use strict;
use warnings; # Prefer over the -w option
use feature qw/say/;
# Reads from standard input/filenames given on command line, prints to
# standard output.
# All samples encountered in the entire input
my %samples;
# The data for each line of input
my @transcripts;
# For each line of input
while (my $line = <<>>) {
# Split into fields on tabs.
my @F = split /\t/, $line;
# Add to the known samples
my @s = split /,/, $F[4];
@samples{@s} = (1) x @s;
my %counts;
# Split the 6th column on semicolon and iterate over each element
for my $read (split /;/, $F[5]) {
# Extract the sample id and increment its count
my ($id, @reads) = split /[:,]/, $read;
$counts{$id} += @reads;
}
# Save this line's counts
push @transcripts, [ $F[0], \%counts ];
}
# Sample ids in sorted order
my @samples = sort { $a <=> $b } keys %samples;
# Header line
say join("\t", "transcript_id", @samples);
foreach my $transcript (@transcripts) {
# And print the counts of each transcript's sample, replacing undef values with 0.
say join("\t", $$transcript[0], map { $_ // 0 } @{$$transcript[1]}{@samples});
}
示例:
$ perl count.pl input.tsv
transcript_id 276 344 365
G157.2 0 3 0
G157.6 10 18 14
G157.9 0 4 0
G157.11 6 0 0
它在多个地方使用的一件事是 slice, to set/retrieve multiple elements of a hash in one expression. Also helps to be familiar with references, which are used in more complicated data structures 而不仅仅是普通数组和散列。
我将浏览您的程序文件并在需要的地方发表评论。
#!/usr/bin/perl -w
use strict;
-w
开关全局打开警告。您应该使用警告的词法版本:use warnings
。它通常被认为是一种更好的做法,因为它允许您在需要时关闭警告,例如:no warnings 'experimental'
.
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash,
$inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron,
$countGene, %unique, );
所有这些变量都未使用。所以删除所有未使用的变量。
你不应该在你的程序的顶部声明你所有的变量并使它们都是全局的。您应该将它们声明为尽可能靠近您将要使用它们的地方,并且在尽可能小的范围内。用 my
声明的变量的范围是最近的封闭块:{ block here }
。这将封装您的代码,并使其更易于阅读。
open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
这是旧的,两个参数开放,通常被认为是不好的做法,因为它允许代码插入。您希望打开三个参数,并使用显式打开模式。您还希望在您的 die 语句中包含错误报告 $!
:
open my $fh, "<", $ARGV[0] or die "Cannot open '$ARGV[0]' for reading: $!";
您也不希望在 die
语句的末尾放置换行符,因为那样会抑制错误中的行号。
while(<INFILE>) {
$line=$_;
chomp $line;
您不必跳过所有这些障碍。你可以做任何一个
while (my $line = <$fh>) { # declared in the smallest possible scope
chomp $line;
或
while (<$fh>) {
chomp; # chomps $_
后者的好处是不必键入要与正则表达式匹配的变量。例如。而不是 $line =~ m/..../
只需键入 m/..../
。这是一个常用的 Perl 习语。
if ($line=~m/\S+/) {
my ($transcript) = ($line=~m/\S+\s+(\S+)/);
my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
你可以只匹配一次,同时捕获三个值:
my ($trans,$udc,$trans_info) = $line =~ /^\S+\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)\s+(\S+)/;
请注意,变量的顺序必须与捕获的值的顺序相同。
如果您使用 split
,这可能会更简单一些
my @vals = split ' ', $line;
my $trans = $vals[1]; # and so on...
这部分
my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
错了。你认为它只是匹配第一个冒号前面的数字,例如344:
。如果你只有一个冒号,那就是它的作用。否则它会匹配该行中最后一个冒号之前的所有字符,因为 \S+
是贪婪的并且会尝试尽可能多地匹配。当您稍后在 ;
上拆分 $unique_donor_ID
时,您将错过最后一个冒号之后的所有数据。可以看到结果here。注意输出的第 2 行以 276 开始并以 365 结束。365 是最后一条记录的开始。
if ($unique_donor_counts !~ m/,/) { # checks for 311,322, ..
my $commas = $transcript_info =~ y/,//; # count by destroying data
my $counts = $commas+1; # unnecessary extra variable
print "$transcript\t$unique_donor_ID\t$counts\n";
} else {
my @spl = split(';', $unique_donor_ID); # splitting the wrong variable
foreach my $i (@spl) {
my $commas = $i =~ y/,//;
my $counts = $commas+1;
my ($donor) = ($i=~m/(\d\d\d)/); # risky way to get the first 3 numbers
print "$transcript\t$donor\t$counts\n";
.....
您的逻辑是检查供体 ID 短列表中的逗号,并分别处理单个值。这很可能是一种有缺陷的方法。为什么不拆分 ;
上的最后一个字段并从那里处理结果?
use strict;
use warnings;
use feature 'say';
my %dat; # file data
my @donors; # store donor ids
my @trans; # store the transcript names in original order
while (<DATA>) {
my @f = split; # split line on whitespace
my $trans = $f[1]; # second field is transcript
push @trans, $trans; # save trans for later
my $tinfo = $f[5]; # 6th field is our payload
my @data = split /;/, $tinfo; # split payload
for (@data) { # for each data segment
my ($donor) = /^(\d+):/; # ... extract donor id
my $count = () = /,/g; # ... count commas
$count++; # ... add 1 comma for whatever reason
#say join "\t", $trans, $count, $donor; # optional print
push @donors, $donor; # store donor ids
$dat{$trans}{$donor} = $count; # this is the data we want
}
}
sub uniq { # for deduping
my %seen;
grep { !$seen{$_}++ } @_;
}
@donors= sort { $a <=> $b } uniq(@donors); # dedupe and sort
@trans = uniq(@trans); # just dedupe, preserve original order from file
say join "\t", "transcript_id", @donors; # print header
for my $key (@trans) {
# for each donor id, for each key, print number, or 0 if undefined
say join "\t", $key, map { $dat{$key}{$_} // 0 } @donors;
}
__DATA__
G157 G157.2 535 3 344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157 G157.6 535 42 276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157 G157.9 535 4 344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157 G157.11 535 6 276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs
注意这两行
my ($donor) = /^(\d+):/;
my $count = () = /,/g;
正在使用隐式模式匹配,/,/g
与$_ =~ /,/g
含义相同。在第一种情况下,我们匹配并捕获,并分配给 $donor
。在第二种情况下,我们强制使用 = () =
进行匹配计数,这是一种用于更改正则表达式上下文的 Perl 技巧。这与 y/,//
的作用相同,只是它不是破坏性的,并且还可以与适当的正则表达式一起使用而不是单个字符。
输出:
transcript_id 276 344 365
G157.2 0 3 0
G157.6 10 18 14
G157.9 0 4 0
G157.11 6 0 0
我在下面附上了一个示例,其中包含我的输入文件中的 5 行(制表符分隔):
G157 G157.2 535 3 344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157 G157.6 535 42 276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157 G157.9 535 4 344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157 G157.11 535 6 276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs
第二列包含转录本 ID,第五列包含读取映射到该转录本的样本 ID,第六列包含所有这些读取的列表。下面对第六列的结构进行解释:
SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes
有三个样本 ID (276,344,&365),但每个转录本可以覆盖任何一个、两个或所有三个样本。
这是我希望输出的样子:
transcript_id 276 344 365
G157.1 0 0 2
G157.2 0 3 0
G157.6 9 18 15
G157.9 0 4 0
G157.11 6 0 0
我已经能够让它工作,但是因为我对 Perl 比较陌生,所以我无法弄清楚如何完全用 Perl 完成这个任务。我在最后使用第二个 R 脚本拼凑了一个矩阵。这是我的 Perl 脚本:
#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash,
$inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron,
$countGene, %unique, );
open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
$line=$_;
chomp $line;
if ($line=~m/\S+/) {
my ($transcript) = ($line=~m/\S+\s+(\S+)/);
my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
if ($unique_donor_counts !~ m/,/) {
my $commas = $transcript_info =~ y/,//;
my $counts = $commas+1;
print "$transcript\t$unique_donor_ID\t$counts\n";
} else {
my @spl = split(';', $unique_donor_ID);
foreach my $i (@spl) {
my $commas = $i =~ y/,//;
my $counts = $commas+1;
my ($donor) = ($i=~m/(\d\d\d)/);
print "$transcript\t$donor\t$counts\n";
}
}
}
}
输出:
G157.1 2 365
G157.2 3 344
G157.6 9 276
G157.6 18 344
G157.6 15 365
G157.9 4 344
G157.11 6 276
我没有打印,而是将它保存到一个输出文件,运行 这个 R 脚本:
library(tidyr)
DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")
我在 Perl 中遇到的问题是,当该行中没有样本 ID 时,我不知道如何获得 0 的计数。我真的很想在 Perl 上做得更好,所以我想弄清楚如何在一个脚本中完成所有这些而不是使用 R 操作矩阵。感谢您花时间帮助 Perl 新手学习!
编辑 post 从我原来的一些进步
我会用很多 split
来完成它。
#!/usr/bin/env perl
use strict;
use warnings; # Prefer over the -w option
use feature qw/say/;
# Reads from standard input/filenames given on command line, prints to
# standard output.
# All samples encountered in the entire input
my %samples;
# The data for each line of input
my @transcripts;
# For each line of input
while (my $line = <<>>) {
# Split into fields on tabs.
my @F = split /\t/, $line;
# Add to the known samples
my @s = split /,/, $F[4];
@samples{@s} = (1) x @s;
my %counts;
# Split the 6th column on semicolon and iterate over each element
for my $read (split /;/, $F[5]) {
# Extract the sample id and increment its count
my ($id, @reads) = split /[:,]/, $read;
$counts{$id} += @reads;
}
# Save this line's counts
push @transcripts, [ $F[0], \%counts ];
}
# Sample ids in sorted order
my @samples = sort { $a <=> $b } keys %samples;
# Header line
say join("\t", "transcript_id", @samples);
foreach my $transcript (@transcripts) {
# And print the counts of each transcript's sample, replacing undef values with 0.
say join("\t", $$transcript[0], map { $_ // 0 } @{$$transcript[1]}{@samples});
}
示例:
$ perl count.pl input.tsv
transcript_id 276 344 365
G157.2 0 3 0
G157.6 10 18 14
G157.9 0 4 0
G157.11 6 0 0
它在多个地方使用的一件事是 slice, to set/retrieve multiple elements of a hash in one expression. Also helps to be familiar with references, which are used in more complicated data structures 而不仅仅是普通数组和散列。
我将浏览您的程序文件并在需要的地方发表评论。
#!/usr/bin/perl -w
use strict;
-w
开关全局打开警告。您应该使用警告的词法版本:use warnings
。它通常被认为是一种更好的做法,因为它允许您在需要时关闭警告,例如:no warnings 'experimental'
.
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash,
$inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron,
$countGene, %unique, );
所有这些变量都未使用。所以删除所有未使用的变量。
你不应该在你的程序的顶部声明你所有的变量并使它们都是全局的。您应该将它们声明为尽可能靠近您将要使用它们的地方,并且在尽可能小的范围内。用 my
声明的变量的范围是最近的封闭块:{ block here }
。这将封装您的代码,并使其更易于阅读。
open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
这是旧的,两个参数开放,通常被认为是不好的做法,因为它允许代码插入。您希望打开三个参数,并使用显式打开模式。您还希望在您的 die 语句中包含错误报告 $!
:
open my $fh, "<", $ARGV[0] or die "Cannot open '$ARGV[0]' for reading: $!";
您也不希望在 die
语句的末尾放置换行符,因为那样会抑制错误中的行号。
while(<INFILE>) {
$line=$_;
chomp $line;
您不必跳过所有这些障碍。你可以做任何一个
while (my $line = <$fh>) { # declared in the smallest possible scope
chomp $line;
或
while (<$fh>) {
chomp; # chomps $_
后者的好处是不必键入要与正则表达式匹配的变量。例如。而不是 $line =~ m/..../
只需键入 m/..../
。这是一个常用的 Perl 习语。
if ($line=~m/\S+/) {
my ($transcript) = ($line=~m/\S+\s+(\S+)/);
my ($transcript_info) = ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
my ($unique_donor_counts)= ($line=~m/\S+\s+\S+\s+\S+\s+\S+\s+(\S+)/);
你可以只匹配一次,同时捕获三个值:
my ($trans,$udc,$trans_info) = $line =~ /^\S+\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)\s+(\S+)/;
请注意,变量的顺序必须与捕获的值的顺序相同。
如果您使用 split
my @vals = split ' ', $line;
my $trans = $vals[1]; # and so on...
这部分
my ($unique_donor_ID) = ($transcript_info =~m/(\S+)\:/);
错了。你认为它只是匹配第一个冒号前面的数字,例如344:
。如果你只有一个冒号,那就是它的作用。否则它会匹配该行中最后一个冒号之前的所有字符,因为 \S+
是贪婪的并且会尝试尽可能多地匹配。当您稍后在 ;
上拆分 $unique_donor_ID
时,您将错过最后一个冒号之后的所有数据。可以看到结果here。注意输出的第 2 行以 276 开始并以 365 结束。365 是最后一条记录的开始。
if ($unique_donor_counts !~ m/,/) { # checks for 311,322, ..
my $commas = $transcript_info =~ y/,//; # count by destroying data
my $counts = $commas+1; # unnecessary extra variable
print "$transcript\t$unique_donor_ID\t$counts\n";
} else {
my @spl = split(';', $unique_donor_ID); # splitting the wrong variable
foreach my $i (@spl) {
my $commas = $i =~ y/,//;
my $counts = $commas+1;
my ($donor) = ($i=~m/(\d\d\d)/); # risky way to get the first 3 numbers
print "$transcript\t$donor\t$counts\n";
.....
您的逻辑是检查供体 ID 短列表中的逗号,并分别处理单个值。这很可能是一种有缺陷的方法。为什么不拆分 ;
上的最后一个字段并从那里处理结果?
use strict;
use warnings;
use feature 'say';
my %dat; # file data
my @donors; # store donor ids
my @trans; # store the transcript names in original order
while (<DATA>) {
my @f = split; # split line on whitespace
my $trans = $f[1]; # second field is transcript
push @trans, $trans; # save trans for later
my $tinfo = $f[5]; # 6th field is our payload
my @data = split /;/, $tinfo; # split payload
for (@data) { # for each data segment
my ($donor) = /^(\d+):/; # ... extract donor id
my $count = () = /,/g; # ... count commas
$count++; # ... add 1 comma for whatever reason
#say join "\t", $trans, $count, $donor; # optional print
push @donors, $donor; # store donor ids
$dat{$trans}{$donor} = $count; # this is the data we want
}
}
sub uniq { # for deduping
my %seen;
grep { !$seen{$_}++ } @_;
}
@donors= sort { $a <=> $b } uniq(@donors); # dedupe and sort
@trans = uniq(@trans); # just dedupe, preserve original order from file
say join "\t", "transcript_id", @donors; # print header
for my $key (@trans) {
# for each donor id, for each key, print number, or 0 if undefined
say join "\t", $key, map { $dat{$key}{$_} // 0 } @donors;
}
__DATA__
G157 G157.2 535 3 344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157 G157.6 535 42 276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157 G157.9 535 4 344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157 G157.11 535 6 276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs
注意这两行
my ($donor) = /^(\d+):/;
my $count = () = /,/g;
正在使用隐式模式匹配,/,/g
与$_ =~ /,/g
含义相同。在第一种情况下,我们匹配并捕获,并分配给 $donor
。在第二种情况下,我们强制使用 = () =
进行匹配计数,这是一种用于更改正则表达式上下文的 Perl 技巧。这与 y/,//
的作用相同,只是它不是破坏性的,并且还可以与适当的正则表达式一起使用而不是单个字符。
输出:
transcript_id 276 344 365
G157.2 0 3 0
G157.6 10 18 14
G157.9 0 4 0
G157.11 6 0 0