根据列值 perl text::csv 拆分文件

Split up files according to column value perl text::csv

之前我已经问过 this question 如何使用 AWK 执行此操作,但它处理得不是很好。 数据在引用的字段中有分号,AWK 没有考虑到这一点。所以我在 perl 中尝试使用 text::csv 模块,所以我不必考虑这个。问题是我不知道如何根据列值将它输出到文件。

上一个问题的简短示例,数据:

10002394;"""22.98""";48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;"""Miami""";http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;Chicago;"""http://testdata.com/bla/28798580.jpg""";5.95;10201848233    
10025825;12.99;65;Chicago;"""http://testdata.com/bla/29017837.jpg""";5.95;93962025367

想要的结果:

File --> 26.csv
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049

File --> 48.csv
10002394;22.98;48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452

File --> 53.csv
10003122;13.0;53;Miami;http://testdata.com/bla/29019899.jpg;5.95;24404000059

File --> 65.csv
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10007645;20.99;65;Chicago;http://testdata.com/bla/28798580.jpg;5.95;10201848233    
10025825;12.99;65;Chicago;http://testdata.com/bla/29017837.jpg;5.95;93962025367

这就是我目前所拥有的。 编辑:修改后的代码:

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
#use Data::Dumper;
use Time::Piece;

my $inputfile  = shift || die "Give input and output names!\n";

open my $infile, '<', $inputfile or die "Sourcefile in use / not found :$!\n";

#binmode($infile, ":encoding(utf8)");

my $csv = Text::CSV_XS->new({binary => 1,sep_char => ";",quote_space => 0,eol => $/});

my %fh;
my %count;
my $country;
my $date = localtime->strftime('%y%m%d');

open(my $fh_report, '>', "report$date.csv");

$csv->getline($infile);

while ( my $elements = $csv->getline($infile)){

EDITED IN:
__________ 
next unless ($elements->[29] =~ m/testdata/);

for (@$elements){
        next if ($elements =~ /apple|orange|strawberry/);
        }
__________

for (@$elements){
        s/\"+/\"/g;
        }

    my $filename = $elements->[2];
    $shop = $elements->[3] .";". $elements->[2];

    $count{$country}++;

        $fh{$filename} ||= do {
            open(my $fh, '>:encoding(UTF-8)', $filename . ".csv") or die "Could not open file '$filename'";
            $fh;
        };

    $csv->print($fh{$filename}, $elements); 
    }

    #print $fh_report Dumper(\%count);
    foreach my $name (reverse sort { $count{$a} <=> $count{$b} or $a cmp $b } keys %count) {
        print $fh_report "$name;$count{$name}\n";
    }

close $fh_report;

错误:

Can't call method "print" on an undefined value at sort_csv_delimiter.pl line 28, <$infile> line 2

我一直在胡思乱想,但我完全不知所措。有人可以帮助我吗?

我的猜测是您需要缓存文件句柄的哈希值,

my %fh;
while ( my $elements = $csv->getline( $infile ) ) {

  my $filename = $elements->[2];

  $fh{$filename} ||= do {
    open my $fh, ">", "$filename.csv" or die $!;
    $fh;
  };

  # $csv->combine(@$elements);
  $csv->print($fh{$filename}, $elements);     
}

我没有看到您所述问题的实例 -- 在引用的字段中出现分号分隔符 ; -- 但你是正确的,Text::CSV 会正确处理它。

这个简短的程序从 DATA 文件句柄中读取您的示例数据并将结果打印到 STDOUT。如果您愿意,我假设您知道如何读取或写入不同的文件。

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });

my @data;

while ( my $row = $csv->getline(\*DATA) ) {
  push @data, $row;
}

my $file;

for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } @data ) {
  unless (defined $file and $file == $row->[2]) {
    $file = $row->[2];
    printf "\nFile --> %d.csv\n", $file;
  }
  $csv->print(\*STDOUT, $row);
}

__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233    
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367

输出

File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049

File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452

File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059

File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;"10201848233    "
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367

更新

我刚刚意识到您的 "desired result" 不是您希望看到的输出,而是将单独的记录写入不同文件的方式。这个程序解决了这个问题。

从您的问题看来,您似乎也希望数据按第一个字段的顺序排序,因此我已将所有文件读入内存并将排序后的版本打印到相关文件中。我还使用 autodie 来避免必须为所有 IO 操作编写状态检查代码。

use strict;
use warnings;
use autodie;

use Text::CSV;

my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });

my @data;

while ( my $row = $csv->getline(\*DATA) ) {
  push @data, $row;
}

my ($file, $fh);

for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } @data ) {
  unless (defined $file and $file == $row->[2]) {
    $file = $row->[2];
    open $fh, '>', "$file.csv";
  }
  $csv->print($fh, $row);
}

close $fh;

__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233    
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367

FWIW 我已经使用 Awk (gawk) 完成了此操作:

awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }' bigfile.txt

other_process data | awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }'

让我解释一下 awk 脚本:

BEGIN {                          # execution block before reading any file (once)
  if(!(col ~/^[1-9]/)) exit 2;   # assert the `col` variable is a positive number
  outname = "part-%s.txt";       # formatting string of the output file names
}
!/^#/ {                          # only process lines not starting with '#' (header/comments in various data files)
  out = sprintf(outname, $col);  # format the output file name, given the value in column `col`
  print > out;                   # put the line to that file
}

如果您愿意,可以添加一个变量来指定自定义文件名或使用当前文件名(或 STDIN)作为前缀:

NR == 1 {                                                         # at the first file (not BEGIN, as we might need FILENAME)
  if(!(col ~/^[1-9]/)) exit 2;                                    # assert the `col` variable is a positive number
  if(!outname) outname = (FILENAME == "-" ? "STDIN" : FILENAME);  # if `outname` variable was not provided (with `-v/--assign`), use current filename or STDIN
  if(!(outname ~ /%s/)) outname = outname ".%s";                  # if `outname` is not a formatting string - containing %s - append it
}
!/^#/ {                                                           # only process lines not starting with '#' (header/comments in various data files)
  out = sprintf(outname, $col);                                   # format the output file name, given the value in column `col`
  print > out;                                                    # put the line to that file
}

注意:如果您提供多个输入文件,只有第一个文件的名称将用作输出前缀。要支持多个输入文件和多个前缀,您可以改用 FNR == 1 并添加另一个变量来区分用户提供的 outname 和自动生成的