根据第 1 列折叠行
Collapse rows based on column 1
我想解析 TopGO R 包的 InterProScan 结果。
我想要一个格式与我所拥有的有点不同的文件。
# input file (gene_ID GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95 GO:0004349, GO:0005737, GO:0006561
Q97R95 GO:0004349, GO:0006561
Q97R95 GO:0005737, GO:0006561
Q97R95 GO:0006561
# desired output (removed duplicates and rows collapsed)
Q97R95 GO:0004349,GO:0005737,GO:0006561
您可以在此处使用整个数据文件测试您的工具:
https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing
您可以使用 gnu awk 的二维数组:
awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[][$i]}
END{for(x in r){
printf "%s ",x;b=0;
for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
print ""}
}' file
它给出:
Q97R95 GO:0005737,GO:0006561,GO:0004349
删除了重复的字段,但未保留顺序。
这是一个希望整洁的 Perl 解决方案。它尽可能保留键和值的顺序,并且不会将整个文件内容保留在内存中,只保留完成工作所需的内容。
#!perl
use strict;
use warnings;
my ($prev_key, @seen_values, %seen_values);
while (<>) {
# Parse the input
chomp;
my ($key, $values) = split /\s+/, $_, 2;
my @values = split /,\s*/, $values;
# If we have a new key...
if ($key ne $prev_key) {
# output the old data, as long as there is some,
if (@seen_values) {
print "$prev_key\t", join(", ", @seen_values), "\n";
}
# clear it out,
@seen_values = %seen_values = ();
# and remember the new key for next time.
$prev_key = $key;
}
# Merge this line's values with previous ones, de-duplicating
# but preserving order.
for my $value (@values) {
push @seen_values, $value unless $seen_values{$value}++;
}
}
# Output what's left after the last line
if (@seen_values) {
print "$prev_key\t", join(", ", @seen_values), "\n";
}
我想解析 TopGO R 包的 InterProScan 结果。
我想要一个格式与我所拥有的有点不同的文件。
# input file (gene_ID GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95 GO:0004349, GO:0005737, GO:0006561
Q97R95 GO:0004349, GO:0006561
Q97R95 GO:0005737, GO:0006561
Q97R95 GO:0006561
# desired output (removed duplicates and rows collapsed)
Q97R95 GO:0004349,GO:0005737,GO:0006561
您可以在此处使用整个数据文件测试您的工具:
https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing
您可以使用 gnu awk 的二维数组:
awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[][$i]}
END{for(x in r){
printf "%s ",x;b=0;
for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
print ""}
}' file
它给出:
Q97R95 GO:0005737,GO:0006561,GO:0004349
删除了重复的字段,但未保留顺序。
这是一个希望整洁的 Perl 解决方案。它尽可能保留键和值的顺序,并且不会将整个文件内容保留在内存中,只保留完成工作所需的内容。
#!perl
use strict;
use warnings;
my ($prev_key, @seen_values, %seen_values);
while (<>) {
# Parse the input
chomp;
my ($key, $values) = split /\s+/, $_, 2;
my @values = split /,\s*/, $values;
# If we have a new key...
if ($key ne $prev_key) {
# output the old data, as long as there is some,
if (@seen_values) {
print "$prev_key\t", join(", ", @seen_values), "\n";
}
# clear it out,
@seen_values = %seen_values = ();
# and remember the new key for next time.
$prev_key = $key;
}
# Merge this line's values with previous ones, de-duplicating
# but preserving order.
for my $value (@values) {
push @seen_values, $value unless $seen_values{$value}++;
}
}
# Output what's left after the last line
if (@seen_values) {
print "$prev_key\t", join(", ", @seen_values), "\n";
}