从每个单元格中删除重复项
Remove duplicates from each cell
我有一个这样的文件,需要在不更改顺序或格式的情况下删除每个单元格中的重复项
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
缺失数据标记为 。 (点).
到目前为止,我已经尝试使用 awk
awk '{str="";c=0;split([=11=],arr,","); for (v in arr) c++; for (m=c;m >= 1;m--) for (n=1; n<m;n++) if (arr[m] == arr[n]) delete arr[m]; for (k=1;k<=c;k++) {if (k ==1 ) {s=arr[k] } else if (arr[k] != "") str=str" "arr[k] } print str}'
但它正在扼杀格式。还有其他方法吗?
预期输出
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
由于输入看起来像 fixed-width,您可以使用 unpack to split it into columns. Then split each cell on comma and use uniq 删除重复项,同时保留顺序。然后用pack
.
输出
use warnings;
use strict;
use List::Util qw(uniq);
my $tmpl = 'A6A6A7A5A6A10A8A15A*';
while (<DATA>) {
my @cols = unpack $tmpl, $_;
for my $c (@cols) {
$c =~ s/^\s+//;
my @items = split /,/, $c;
$c = join ',', uniq(@items);
}
print pack($tmpl, @cols), "\n";
}
__DATA__
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
输出:
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
和sed
$ sed -E 's/\t(.*),/\t/g;s/,+\t/\t/g' file | column -ts$'\t'
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
假设您的文件是固定宽度的,而不是制表符分隔的,您可以使用正则表达式对字段进行重复数据删除。匹配 non-whitespace 的任何完整字符串,以逗号分隔,对结果进行重复数据删除,然后用逗号将其连接回去。为每个删除的字符添加空格以修复格式。
use strict;
use warnings;
my $hdr = <DATA>;
print $hdr;
while (<DATA>) {
s/(\S+)/ my %s; my $n = join ',', grep { !$s{$_}++ } split ',', ; $n .= ' ' x (length() - length($n)); $n; /eg;
print;
}
__DATA__
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
输出:
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
使用任何 POSIX awk:
$ cat tst.awk
NR==1 {
hdr = [=10=]
while ( match(hdr,/[^[:space:]]+[[:space:]]+/) ) {
width[++i] = RLENGTH
hdr = substr(hdr,RSTART+RLENGTH)
}
}
{
for ( i=1; i<=NF; i++ ) {
fld = ""
delete seen
n = split($i,parts,/,/)
for ( j=1; j<=n; j++ ) {
part = parts[j]
if ( (part != "") && !seen[part]++ ) {
fld = (fld == "" ? "" : fld ",") part
}
}
printf "%-*s", width[i], fld
}
print ""
}
$ awk -f tst.awk file
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
上面假设你真的不希望 header 行中的“From”比它下面的数据值早 1 个字符开始,也不希望“Code”是 right-aligned 当一切否则是 left-aligned.
我有一个这样的文件,需要在不更改顺序或格式的情况下删除每个单元格中的重复项
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
缺失数据标记为 。 (点).
到目前为止,我已经尝试使用 awk
awk '{str="";c=0;split([=11=],arr,","); for (v in arr) c++; for (m=c;m >= 1;m--) for (n=1; n<m;n++) if (arr[m] == arr[n]) delete arr[m]; for (k=1;k<=c;k++) {if (k ==1 ) {s=arr[k] } else if (arr[k] != "") str=str" "arr[k] } print str}'
但它正在扼杀格式。还有其他方法吗?
预期输出
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
由于输入看起来像 fixed-width,您可以使用 unpack to split it into columns. Then split each cell on comma and use uniq 删除重复项,同时保留顺序。然后用pack
.
use warnings;
use strict;
use List::Util qw(uniq);
my $tmpl = 'A6A6A7A5A6A10A8A15A*';
while (<DATA>) {
my @cols = unpack $tmpl, $_;
for my $c (@cols) {
$c =~ s/^\s+//;
my @items = split /,/, $c;
$c = join ',', uniq(@items);
}
print pack($tmpl, @cols), "\n";
}
__DATA__
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
输出:
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
和sed
$ sed -E 's/\t(.*),/\t/g;s/,+\t/\t/g' file | column -ts$'\t'
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
假设您的文件是固定宽度的,而不是制表符分隔的,您可以使用正则表达式对字段进行重复数据删除。匹配 non-whitespace 的任何完整字符串,以逗号分隔,对结果进行重复数据删除,然后用逗号将其连接回去。为每个删除的字符添加空格以修复格式。
use strict;
use warnings;
my $hdr = <DATA>;
print $hdr;
while (<DATA>) {
s/(\S+)/ my %s; my $n = join ',', grep { !$s{$_}++ } split ',', ; $n .= ' ' x (length() - length($n)); $n; /eg;
print;
}
__DATA__
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
输出:
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
使用任何 POSIX awk:
$ cat tst.awk
NR==1 {
hdr = [=10=]
while ( match(hdr,/[^[:space:]]+[[:space:]]+/) ) {
width[++i] = RLENGTH
hdr = substr(hdr,RSTART+RLENGTH)
}
}
{
for ( i=1; i<=NF; i++ ) {
fld = ""
delete seen
n = split($i,parts,/,/)
for ( j=1; j<=n; j++ ) {
part = parts[j]
if ( (part != "") && !seen[part]++ ) {
fld = (fld == "" ? "" : fld ",") part
}
}
printf "%-*s", width[i], fld
}
print ""
}
$ awk -f tst.awk file
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
上面假设你真的不希望 header 行中的“From”比它下面的数据值早 1 个字符开始,也不希望“Code”是 right-aligned 当一切否则是 left-aligned.