根据第 1 列折叠行

Collapse rows based on column 1

我想解析 TopGO R 包的 InterProScan 结果。

我想要一个格式与我所拥有的有点不同的文件。

# input file (gene_ID  GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95  GO:0004349, GO:0005737, GO:0006561
Q97R95  GO:0004349, GO:0006561
Q97R95  GO:0005737, GO:0006561
Q97R95  GO:0006561


# desired output (removed duplicates and rows collapsed)
Q97R95  GO:0004349,GO:0005737,GO:0006561

您可以在此处使用整个数据文件测试您的工具:

https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing

您可以使用 gnu awk 的二维数组:

awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[][$i]}
         END{for(x in r){
                printf "%s ",x;b=0;
                for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
                print ""}
         }' file

它给出:

Q97R95 GO:0005737,GO:0006561,GO:0004349

删除了重复的字段,但未保留顺序。

这是一个希望整洁的 Perl 解决方案。它尽可能保留键和值的顺序,并且不会将整个文件内容保留在内存中,只保留完成工作所需的内容。

#!perl
use strict;
use warnings;

my ($prev_key, @seen_values, %seen_values);

while (<>) {
  # Parse the input
  chomp;
  my ($key, $values) = split /\s+/, $_, 2;
  my @values = split /,\s*/, $values;

  # If we have a new key...
  if ($key ne $prev_key) {
    # output the old data, as long as there is some,
    if (@seen_values) {
      print "$prev_key\t", join(", ", @seen_values), "\n";
    }
    # clear it out,
    @seen_values = %seen_values = ();
    # and remember the new key for next time.
    $prev_key = $key;
  }

  # Merge this line's values with previous ones, de-duplicating
  # but preserving order.
  for my $value (@values) {
    push @seen_values, $value unless $seen_values{$value}++;
  }
}

# Output what's left after the last line
if (@seen_values) {
  print "$prev_key\t", join(", ", @seen_values), "\n";
}