如何使用 Perl 在大文件的顶部以特定模式最佳地移动行?

How to optimally move lines with a specific pattern at the top in a huge file using Perl?

我有一个近 20k 行的巨大 csv 文件,格式如下:

file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

我需要在顶部放置 2 行具有相同语法(即第 4 列)的模式。然后其余的行将按原样存在。这意味着前两行的语法为 'perl',然后是 'java'、'python' 等。

到目前为止,我已经使用 seek and tell 编写了下面的代码以使其优化。但是,它没有按预期工作。

use strict;
use warnings;

open(FP, "+<mycsv.csv");

my %hash = ();
my $cur_pos;    


while(<FP>) {

    my $line = $_;
    chomp $line;
    #print "$line aaa\n";
    if($line =~ /^file\,tools,/) {next;}

    if($line =~ /^\w+\,\w+\,\w+,(\w+)\,.*$/) {
        my $type = ;
        #print "type $type\n";

    if($hash{$type}->{count} < 2 ) {
        #print "--- here type = $type | lastpos = ", $hash{$type}->{lastpos} , "\n";
        $cur_pos = tell(FP);
        my $pos = tell(FP) - length($line); 
        if($hash{$type}->{lastpos} ) {

            my $lastpos = $hash{$type}->{lastpos};
            seek(FP, $lastpos, 1);
            print FP $line;
            seek(FP, $cur_pos, 1);
        } 

        $hash{$type}->{lastpos} = $pos;


    }
        if(exists $hash{$type} ) {
            $hash{$type}->{count} += 1;
        } else {
            $hash{$type}->{count} = 1;
        }


    }
}


close(FP);

预期输出应如下所示:

 file,tools,edit,syntax,buffers
    a,b,c,perl,d
    a,e,c,perl,d
    a,w,c33,java,d
    wa,b,c33,java,d
    a,s,c,python,d1
    a,f,c,python,dd
    a,n,c,php,d3
    wa,b,c33,php,d
    d,r,hhh,cpp,d0
    d,buuu,hhh,cpp,d0
    d,m,hhh,c#,d0
    wa,b,c33,c#,d
    a,o,c,pdf,d3 
    a,yb,c,c,ddf 
    d44,b,hhh,nlp,d0
    a,be,c,js,d4  
    a,h,c,perl,dg   
    a,b,c,perl,dt   
    wa,b,c33,java,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,php,d
    wa,b,c33,python,d
    wa,b,c33,perl,d
    wa,b,c33,php,d
    wa,b,c33,java,d
    wa,b,c33,python,d

如能提供帮助,我们将不胜感激。

谢谢。

对于相同的逻辑,我得到的输出与您的输出略有不同。您能否查看此输出并让我知道是否需要进行任何更改?在评论中提到了方法。

use strict;
use warnings;
use feature 'say';
my $syntax = [];
my $NUM = 2;   # change number if needed
my $filename = 'file.txt';
my $data = {};  # make a hash of data

open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
    chomp $row;
    next if $. == 1; # skip header row
    my @columns = split (',', $row);

    push @$syntax, $columns[3];   # make a list of all syntaxes available
    push @{$data->{$columns[3]}}, $row;
}
close $fh;

my $processed = {};
# loop throught the syntax array and print data from hash
# also, make a counter of the number of times that syntax is used.
# it will help us to skip next (n-1) occurence of that syntax
for my $syntax (@$syntax) {
    if (!$processed->{$syntax}){
        for my $s (splice @{$data->{$syntax}}, 0, $NUM) {
            $processed->{$syntax} += 1;
            say $s;
        }
    } else {
        $processed->{$syntax} -= 1;
    }
}
# print out the remaining values
for my $rem (values %$data){
    say for @$rem;    
}

输出:

a,b,c,perl,d
a,e,c,perl,d
a,w,c33,java,d
wa,b,c33,java,d
a,s,c,python,d1
a,f,c,python,dd
a,n,c,php,d3
wa,b,c33,php,d
d,r,hhh,cpp,d0
d,buuu,hhh,cpp,d0
d,m,hhh,c#,d0
wa,b,c33,c#,d
a,o,c,pdf,d3
a,yb,c,c,ddf
a,h,c,perl,dg
a,b,c,perl,dt
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,perl,d

我会通过解析文件来收集数据结构中的第一对行并将其他行发送到临时文件来解决这个问题。完成文件解析后,将数据结构中的行对打印到输出文件中,然后将临时文件添加到输出文件的末尾。

示例代码:

use strict;
use warnings;
use feature ':5.16';

my $infile = 'infile';
my $outfile = 'outfile';
my $tempfile = 'temp';
my $quantity = 2;  # or whatever

open my $in, '<', $infile or die 'Could not open infile: ' . $!;
open my $out, '>', $outfile or die 'Could not create output file: ' . $!;
open my $temp, '>', $tempfile or die 'Could not create tempfile: ' . $!;

my $hash = {};
my @order;
my $hdr;

while ( <$in> ) {
  if ( $hdr ) {
    my @cols = split ",", $_;
    my $key = $cols[3];

    # have we seen this key before?
    if ( ! $hash->{$key} ) {
      push @order, $key;
      $hash->{$key} = [ $_ ];
    }
    elsif ( scalar @{$hash->{$key}} < $quantity ) {
      push @{$hash->{$key}}, $_;
    }
    else {
      print { $temp } $_;
    }
  }
  else {
    # the header line
    print { $out } $_;
    $hdr = $_;
  }
}

# print the collected twofers out into the tempfile
for my $key ( @order ) {
  print { $out } @{$hash->{$key}};
}
close $out;
close $temp;

# concatenate the files
system join ' ', ( 'cat', $tempfile, '>>', $outfile );

如果成对的行不必按照它们在源文件中出现的顺序排列,您可以跳过 @order 内容。

I have a huge CSV file of nearly 20k rows with below format:

无论如何都不算​​大。文件大小大概是一兆左右。

虽然我通常建议逐行处理以确保文件大小的稳健性,但在这种情况下,您知道您正在处理的文件很小。问题是你花在优化这个东西上的时间值不值得。

如果我理解正确,你的问题可以通过浪费一些内存快速解决(在程序员的时间内):

#!/usr/bin/env perl

use strict;
use warnings;
use List::Util qw( uniqstr );

my $TOP = 2;

(my $header = <DATA>) =~ s/\s+\z//;
my @header = split /,|\s+/, $header;
my %idx = map +($header[$_] => $_), 0 .. $#header;

my @lines = grep /\S/, <DATA>;
my %syntax_of = map +($_ => (split /,/, $_)[$idx{syntax}]), @lines;

my @syntaxes = uniqstr map $syntax_of{$_}, @lines;

my %lines_of;
for my $n (0 .. $#lines) {
    push @{$lines_of{$syntax_of{$lines[$n]}}}, $n;
}

print "$header\n";

for my $syntax (@syntaxes) {
    my @top = grep defined, map $lines_of{$syntax}->[$_ - 1], 1 .. $TOP;
    print @lines[@top];
    # normally, invoking delete on an array slice is not
    # but it is just what we need here.
    delete @lines[@top];
}

print grep defined, @lines;

__DATA__
file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

PS:另见 Tie::File

PPS:乍一看,如果有人想花时间在这上面,至少有六件事可能会倾向于调整。