如何使用 Perl 检测文件中的多个重复字段?

How do I detect multiple duplicate fields in a file using Perl?

我的经纪账户中有一堆 NETFLIX 订单。 我无意中在 1/5 和 1/6 输入了两个重复的 gtc 卖单。 我如何使用 Perl 脚本检测它?

 Buy NFLX     50 @  315.00  Reg-Acct Fake
 Buy NFLX     50 @  317.50  Reg-Acct OPEN              01/13/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15
Sell NFLX     50 @  362.00  Reg-Acct OPEN              11/25/14
...
Sell NFLX     50 @  345.00  IRA-Acct OPEN              09/15/14

我希望脚本只吐出这两行, 由 fields[0]fields[6] 判断为相同。

Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15

我更喜欢一个简单的脚本(即没有一行代码,没有散列),因为我是 Perl 的新手。

谢谢, 拉里

我知道你说没有单行,但如果你只是想说没有 perl 单行:

sort filename|rev|uniq -D -f 1|rev

I would prefer a simple script (no hash)

呃。错过了 no hash。不幸的是,simpleno hash 是相反的目标——更不用说 no hash 意味着 效率不高,即。请参阅底部的代码,了解您应该如何操作。同时,您需要并行数组:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my @orders;
my @counts;

my $fname = 'data3.txt';

open my $ORDERSFILE, '<', $fname
    or die "Couldn't open $fname: $!";

LINE:
while (my $line = <$ORDERSFILE>) {
    my @pieces = split ' ', $line;
    my $date = pop @pieces;
    my $order = join ' ', @pieces;

    if (not @orders) { #then length of @orders is 0
        $orders[0] = $order;
        $counts[0] = 1;
        next LINE;
    }

    for my $i (0..$#orders) {
        if ($orders[$i] eq $order) {
            $counts[$i]++;
            next LINE;
        }
    }
    #If execution reaches here, then the order wasn't found in the array...
    my $i = $#counts + 1;
    $orders[$i] = $order;
    $counts[$i] = 1
}

say Dumper(\@orders);
say Dumper(\@counts);


for my $i (0..$#counts) {
    if ($counts[$i] > 1) {
        say "($counts[$i]) $orders[$i]";
    }
}

--output:--
$VAR1 = [
          'Buy NFLX 50 @ 315.00 Reg-Acct',
          'Buy NFLX 50 @ 317.50 Reg-Acct OPEN',
          'Sell NFLX 50 @ 345.00 Reg-Acct OPEN',
          'Sell NFLX 50 @ 362.00 Reg-Acct OPEN',
          'Sell NFLX 50 @ 345.00 IRA-Acct OPEN'
        ];

$VAR1 = [
          1,
          1,
          2,
          1,
          1
        ];

(2) Sell NFLX 50 @ 345.00 Reg-Acct OPEN

这里有一些更好的解决方案:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my %dates_for;   #A key will be an order; a value will be a reference to an array of dates.

while (my $line = <DATA>) {
    my @pieces = split ' ', $line;
    my $date = pop @pieces;
    my $order = join ' ', @pieces;

    push @{$dates_for{$order}}, $date;  #autovivification (see explanation below)
}

say Dumper(\%dates_for);

my @dates;

for my $order (keys %dates_for) {
    @dates = @{$dates_for{$order}};
    my $dup_count = @dates;

    if ($dup_count > 1) {
        say "($dup_count) $order";
        say "   $_" for @dates;
    }
}


__DATA__
 Buy NFLX     50 @  315.00  Reg-Acct Fake
 Buy NFLX     50 @  317.50  Reg-Acct OPEN              01/13/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15
Sell NFLX     50 @  362.00  Reg-Acct OPEN              11/25/14
Sell NFLX     50 @  345.00  IRA-Acct OPEN              09/15/14  


--output:--
$VAR1 = {
          'Sell NFLX 50 @ 345.00 IRA-Acct OPEN' => [
                                                     '09/15/14'
                                                   ],
          'Sell NFLX 50 @ 345.00 Reg-Acct OPEN' => [
                                                     '01/05/15',
                                                     '01/06/15'
                                                   ],
          'Buy NFLX 50 @ 317.50 Reg-Acct OPEN' => [
                                                    '01/13/15'
                                                  ],
          'Buy NFLX 50 @ 315.00 Reg-Acct' => [
                                               'Fake'
                                             ],
          'Sell NFLX 50 @ 362.00 Reg-Acct OPEN' => [
                                                     '11/25/14'
                                                   ]
        };

(2) Sell NFLX 50 @ 345.00 Reg-Acct OPEN
   01/05/15
   01/06/15

When an undefined variable is dereferenced, it gets silently upgraded to an array or hash reference (depending of the type of the dereferencing). This behaviour is called autovivification and usually does what you mean (e.g. when you store a value)....

http://search.cpan.org/~vpit/autovivification-0.14/lib/autovivification.pm

对于固定宽度的列,使用 unpack() 效率更高:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my $fname = 'data3.txt';

open my $ORDERSFILE, '<', $fname
    or die "Couldn't open $fname: $!";

my %dates_for;

while (my $line = <$ORDERSFILE>) {
    my ($order, $date) = unpack 'A41 @55 A*', $line;   #see explanation below
    push @{$dates_for{$order}}, $date;
}

close $ORDERSFILE;

say Dumper(\%dates_for);

my @dates;

for my $order (keys %dates_for) {
    @dates = @{$dates_for{$order}};

    if (@dates > 1) {
        my $dup_count = @dates;
        say "($dup_count) $order";
        say "   $_" for @dates;
    }
}

--output:--
$VAR1 = {
          ' Buy NFLX     50 @  317.50  Reg-Acct OPEN' => [
                                                           '01/13/15'
                                                         ],
          'Sell NFLX     50 @  362.00  Reg-Acct OPEN' => [
                                                           '11/25/14'
                                                         ],
          'Sell NFLX     50 @  345.00  Reg-Acct OPEN' => [
                                                           '01/05/15',
                                                           '01/06/15'
                                                         ],
          ' Buy NFLX     50 @  315.00  Reg-Acct Fake' => [
                                                           ''
                                                         ],
          'Sell NFLX     50 @  345.00  IRA-Acct OPEN' => [
                                                           '09/15/14'
                                                         ]
        };

(2) Sell NFLX     50 @  345.00  Reg-Acct OPEN
   01/05/15
   01/06/15

A41 @55 A* => 提取 41 个字符(A),
...............................跳到位置 55(@55),
.....................提取剩余字符(A*)

您可以向前和向后跳到任何您想要的位置,这意味着您可以按照您想要的任何顺序提取片段。