循环遍历 perl 中的列的问题

Question

我编写的脚本有问题。我已将其分解以确定问题所在。

输入文件（制表符分隔）：

FORMAT  Sample1        Sample2        Sample3
GT:AD:DP:GQ:PL  0/1:17,6:23:85:85,0,370 0/0:51,6:57:17:0,17,1359        0/0:3,0:3:9:0,9,99
GT:AD:DP:GQ:PGT:PID:PL  0/0:3,0:3:0:.:.:0,0,38  0/0:1,0:1:3:.:.:0,3,33  0/1:1,2:3:26:0|1:13813_T_G:81,0,26
GT:AD:DP:GQ:PGT:PID:PL  ./.:2,0:2:.:.:.:0,0,0   0/0:1,0:1:3:.:.:0,3,33  0/1:1,2:3:26:0|1:13813_T_G:81,0,26
GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0       1/1:0,4:4:12:131,12,0   ./.:0,0:0:.:0,0,0
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,2:2:6:1|1:14590_G_A:90,6,0        0/0:3,0:3:9:.:.:0,9,98  0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,2:2:6:1|1:14590_G_A:90,6,0        0/0:3,0:3:9:.:.:0,9,98  0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,2:2:6:1|1:14590_G_A:90,6,0        0/0:3,0:3:9:.:.:0,9,98  0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,2:2:6:1|1:14590_G_A:90,6,0        0/0:2,0:2:6:.:.:0,6,72  0/0:1,0:1:3:.:.:0,3,30
GT:AD:DP:GQ:PL  1/1:0,7:7:21:186,21,0   0/1:5,4:9:79:79,0,103   ./.:1,0:1:.:0,0,0

每行所需的输出（从每个样本的冒号前取前 3 个字符）并打印每行：

GT:AD:DP:GQ:PL  0/1 0/0       0/0
GT:AD:DP:GQ:PGT:PID:PL  0/0  0/0  0/1
GT:AD:DP:GQ:PGT:PID:PL  ./.   0/0  0/1
GT:AD:DP:GQ:PL  ./.       1/1   ./.
GT:AD:DP:GQ:PGT:PID:PL  1/1        0/0  0/0
GT:AD:DP:GQ:PGT:PID:PL  1/1       0/0  0/0
GT:AD:DP:GQ:PGT:PID:PL  1/1       0/0  0/0
GT:AD:DP:GQ:PGT:PID:PL  1/1      0/0  0/0
GT:AD:DP:GQ:PL  1/1   0/1  ./.

我用来执行此步骤的代码没有按每行预期的那样生成正确的 0/0、0/1、0/2 代码。我认为这是我如何编写 for 循环的问题，但我不确定。

#!/usr/bin/perl
use strict;

my $inputfile1 = $ARGV[0];
open (FILE1, $inputfile1) or die "Uh oh.. unable to find file $inputfile1"; ##Opens input file

my @file1 = <FILE1>; #loads inputfile1 data into array
close FILE1;


my (@colsplit, @genotypes1, @genotypes2, @genotypes3, @joined); 
foreach my $line(@file1) { ## process each line, splitting columns and move onto next line
    @colsplit = split("\t", $line);
        push (@joined, $colsplit[0]);
            foreach my $lines(@colsplit) {
                if ($colsplit[1] =~ m/(^0\/1)/ || $colsplit[1] =~ m/(^0\/0)/ || $colsplit[1]=~ m/(^1\/0)/ || $colsplit[1] =~ m/(^1\/1)/ || $colsplit[1] =~ m/(^.\/.)/) {
                    push (@genotypes1, );
                    }
                    if ($colsplit[2] =~ m/(^0\/1)/ || $colsplit[2] =~ m/(^0\/0)/ || $colsplit[2] =~ m/(^1\/0)/ || $colsplit[2] =~ m/(^1\/1)/ || $colsplit[2] =~ m/(^.\/.)/) {
                    push (@genotypes2, );
                    }
                    if ($colsplit[3] =~ m/(^0\/1)/ || $colsplit[3] =~ m/(^0\/0)/ || $colsplit[3] =~ m/(^1\/0)/ || $colsplit[3] =~ m/(^1\/1)/ || $colsplit[3] =~ m/(^.\/.)/) {
                    push (@genotypes3, ); 
                    }       
                }
            }



my $i = 0;
foreach my $line(@joined) {
    if ($line =~ m/GT/) {
print "$line\t$genotypes1[$i]\t$genotypes2[$i]\t$genotypes3[$i]\n";
$i++;
    }}

我认为问题可能在于，一旦第一个 sample1 列匹配，它就会跳转到第二行，而不是遍历第二个 sample2 列。否则我看不出我是怎么搞砸的！快把我逼疯了！

我当前的输出是：

GT:AD:DP:GQ:PL  0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL  0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL  0/1 0/0 0/0
GT:AD:DP:GQ:PL  0/1 0/0 0/0
GT:AD:DP:GQ:PGT:PID:PL  0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL  0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL  0/0 0/0 0/1
GT:AD:DP:GQ:PGT:PID:PL  0/0 0/0 0/1
GT:AD:DP:GQ:PL  ./. 0/0 0/1

这显然不是我想要的！

如有任何帮助，我们将不胜感激。

Ps。我是新手，所以放轻松。

Answer 1

问题似乎很清楚，要在每一列中保留 : 之前的数据（第一列除外）。但是后来我对尝试的代码感到有些困惑，这些代码不必要地经历了列的特定模式。

这里简单介绍一下所描述的内容

use warnings;
use strict;
use feature 'say';

my $header_line = <>;   # drop the first line

while (<>) {            # line by line from files given on cmdline, or STDIN
    chomp;

    my ($fmt_col, @cols) = split /\t/;   # (but sample has spaces, not tabs)

    s/(.*?):.*// for @cols;            # keep up to (first) : in each field
    
    say join "\t", $fmt_col, @cols;      # print fields joined by tabs
}

<> 从命令行中给定的所有文件中读取所有行，或 STDIN；所以当运行这个程序时，在命令行上提交一个文件名来处理。

请注意，发布的示例没有制表符，而是空格；所以上面的代码如果复制粘贴就会失败。要么将空格更改为制表符以进行测试，要么将 split /\t/; 更改为 split;（以便使用其默认拆分，即任意数量的任意空格）。

除第一个字段之外的所有字段都已更改，以便仅保留第一个 : 之前的字符。

这是使用在 foreach 循环（“topicalizer”）中处理的每个项目都别名为当前处理的元素这一事实来完成的。因此，当正则表达式 s/// 更改它时， @cols 的相应元素也会更改。如果这看起来太难吃，请一定要慢慢地把它写出来。

如果确实需要做其他事情，请说明。

Answer 2

我相信我已经解决了这个问题。我删除了 foreach my $lines(@colsplit) {，现在可以使用了！代码如下：

#!/usr/bin/perl
use strict;

my $inputfile1 = $ARGV[0];
open (FILE1, $inputfile1) or die "Uh oh.. unable to find file $inputfile1"; ##Opens input file

my @file1 = <FILE1>; #loads inputfile1 data into array
close FILE1;


my (@colsplit, @genotypes1, @genotypes2, @genotypes3, @joined); 
foreach my $line(@file1) { ## process each line, splitting columns and move onto next line
    @colsplit = split("\t", $line);
        push (@joined, $colsplit[0]);
                if ($colsplit[1] =~ m/(^0\/1)/ || $colsplit[1] =~ m/(^0\/0)/ || $colsplit[1]=~ m/(^1\/0)/ || $colsplit[1] =~ m/(^1\/1)/ || $colsplit[1] =~ m/(^.\/.)/) {
                    push (@genotypes1, );
                    }
                    if ($colsplit[2] =~ m/(^0\/1)/ || $colsplit[2] =~ m/(^0\/0)/ || $colsplit[2] =~ m/(^1\/0)/ || $colsplit[2] =~ m/(^1\/1)/ || $colsplit[2] =~ m/(^.\/.)/) {
                    push (@genotypes2, );
                    }
                    if ($colsplit[3] =~ m/(^0\/1)/ || $colsplit[3] =~ m/(^0\/0)/ || $colsplit[3] =~ m/(^1\/0)/ || $colsplit[3] =~ m/(^1\/1)/ || $colsplit[3] =~ m/(^.\/.)/) {
                    push (@genotypes3, ); 
                }
            }



my $i = 0;
foreach my $line(@joined) {
    if ($line =~ m/GT/) {
print "$line\t$genotypes1[$i]\t$genotypes2[$i]\t$genotypes3[$i]\n";
$i++;
    }}

特别感谢@zdim 向我展示了一个更优雅的解决方案。 E

Answer 3

一种可能的方法是使用正则表达式匹配的continuation modifier (c)。它会导致匹配操作从最后一个找到匹配项的地方开始。这样，一行的每个字段都可以处理。

#!/usr/bin/perl

use strict;
use warnings;

my $fh;
$ARGV[0] && open($fh, $ARGV[0]) || die();

# discard first line;
<$fh>;

foreach my $line (<$fh>) {
  chomp($line);

  # capture every non tab character from the beginning of the line;
  # globally match the pattern repeatedly in the string (g modifier)
  # keep the current position during repeated matching (c modifier);
  $line =~ m/^([^\t]*)/gc;
  print("");

  # capture every non colon character after a tab character;
  # globally match the pattern repeatedly in the string (g modifier);
  # keep the current position during repeated matching (c modifier);
  while ($line =~ m/\t([^:]*)/gc) {
    print("\t");
  }

  print("\n");
}

循环遍历 perl 中的列的问题

Issue with looping through columns in perl

perl

bioinformatics