比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件

Question

我有 2 个包含多个字段（大约 30 个字段）的 csv 文件，而且大小很大（大约 4GB）。

File1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

File2:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"

我想比较这 2 个文件并将差异报告到另一个 csv 文件中。在上面的示例中，Employee Vivek 和 Karthick 详细信息出现在不同的行号中，但记录数据仍然相同，因此应视为匹配。员工 Vinoth 记录应被视为不匹配，因为地址不匹配。

输出 diff.csv 文件可以包含文件 1 和文件 2 中的不匹配记录，如下所示。

Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India" 
F2, Vinoth,12,2548.245,"140,North Street,USA"

到目前为止，我已经编写了代码。在此之后我很困惑选择哪个选项是二进制搜索还是任何其他有效的方法来执行此操作。你能帮帮我吗？

My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.

use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );

# Get Command Line Parameters

my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
  or die("Error in command line arguments\n");

if ( !defined $opts{'file1'} ) {
    die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
    die "CSV file --file2 not specified.\n";
}

my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';

print $file2 . "\n";

my $csv1 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );

open( my $fh1, '<:encoding(utf8)', $file1 )
  or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
  or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
  or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );

my $f1line   = undef;
my $f2line   = undef;
my $header1  = undef;
my $f1empty  = 'false';
my $f2empty  = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
    $f1line = $csv1->getline($fh1);
}
while (1) {

    if ( $f1empty eq 'false' ) {
        $f1line = $csv1->getline($fh1);
    }
    if ( !defined $f1line ) {
        $f1empty = 'true';
    }

    if ( $f1empty eq 'true' ) {
        last;
    }
    else {
        ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.

        }
    }

print "End of Program" . "\n";

Answer 1

在数据库中存储如此 数量级 的数据是处理此类任务的最正确方法。至少 SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL 会很好用。

以下代码演示了如何在没有特殊模块的情况下实现所需的输出，但它没有考虑可能 混乱的输入数据 。此脚本会将数据记录报告为不同，即使差异可能只是一个额外的 space.

默认输出到控制台 window，除非您指定选项 output。

注意：整个文件 #1 已读入内存，请耐心等待处理大文件可能需要一段时间。

use strict;
use warnings;
use feature 'say';

use Getopt::Long qw(GetOptions);
use Pod::Usage;

my %opt;
my @args = (
            'file1|f1=s',
            'file2|f2=s',
            'output|o=s',
            'debug|d',
            'help|?',
            'man|m'
        );

GetOptions( \%opt, @args ) or pod2usage(2);

print Dumper(\%opt) if $opt{debug};

pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};

pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};

unlink $opt{output} if defined $opt{output} and -f $opt{output};

compare($opt{file1},$opt{file2});

sub compare {
    my $fname1 = shift;
    my $fname2 = shift;

    my $hfile1 = file2hash($fname1);

    open my $fh, '<:encoding(utf8)', $fname2
        or die "Couldn't open $fname2";

    while(<$fh>) {
        chomp;
        next unless /^(.*?),(.*)$/;
        my($key,$data) = (, );
        if( !defined $hfile1->{$key} ) {
            my $msg = "$fname1 $key is missing";
            say_msg($msg);
        } elsif( $data ne $hfile1->{$key} ) {
            my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
            say_msg($msg);
        }
    }
}

sub say_msg {
    my $msg = shift;

    if( $opt{output} ) {
        open my $fh, '>>:encoding(utf8)', $opt{output}
            or die "Couldn't to open $opt{output}";

        say $fh $msg;

        close $fh;
    } else {
        say $msg;
    }
}

sub file2hash {
    my $fname = shift;
    my %hash;

    open my $fh, '<:encoding(utf8)', $fname
        or die "Couldn't open $fname";

    while(<$fh>) {
        chomp;
        next unless /^(.*?),(.*)$/;
        $hash{} = ;

    }

    close $fh;

    return \%hash;
}

__END__

=head1 NAME

comp_cvs - compares two CVS files and stores differense 

=head1 SYNOPSIS

 comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt

 Options:
    -f1,--file1 input CVS filename #1
    -f2,--file2 input CVS filename #2
    -o,--output output filename
    -d,--debug  output debug information
    -?,--help   brief help message
    -m,--man    full documentation

=head1 OPTIONS

=over 4

=item B<-f1,--file1>

Input CVS filename #1

=item B<-f2,--file2>

Input CVS filename #2

=item B<-o,--output>

Output filename

=item B<-d,--debug>

Print debug information.

=item B<-?,--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=back

=head1 DESCRIPTION

B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.

=head1 EXIT STATUS

The section describes B<EXIT STATUS> codes of the program

=head1 ENVIRONMENT

The section describes B<ENVIRONMENT VARIABLES> utilized in the program

=head1 FILES

The section describes B<FILES> which used for program's configuration

=head1 EXAMPLES

The section demonstrates some B<EXAMPLES> of the code

=head1 REPORTING BUGS

The section provides information how to report bugs

=head1 AUTHOR

The section describing author and his contanct information

=head1 ACKNOWLEDGMENT

The section to give credits people in some way related to the code

=head1 SEE ALSO

The section describing related information - reference to other programs, blogs, website, ...

=head1 HISTORY

The section gives historical information related to the code of the program

=head1 COPYRIGHT

Copyright information related to the code

=cut

测试文件的输出

file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"

Answer 2

#!/usr/bin/env perl

use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;

my %POS;
my %chars;

open my $FILEA,'<',q{FileA.txt}
    or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
    or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
    or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
    or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
    or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
    $POS{FILEA}=tell $FILEA;
<$FILEB>,
    $POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';

{ # Scan for first character of the records involved
    while (<$FILEA>) {
        $chars{substr($_,0,1)}++;
        };
    while (<$FILEB>) {
        $chars{substr($_,0,1)}--;
        };
    # So what characters do we need to deal with?
    warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
    };
my @chars=sort keys %chars;
{
    my %_h;
    # For each of the characters in our character set
    for my $char (@chars) {
        warn Data::Dumper->Dump([$char],[qw(*char)]),' ';
        # Beginning of data sections
        seek $FILEA,$POS{FILEA},0;
        seek $FILEB,$POS{FILEB},0;
        %_h=();
        my $pos=tell $FILEA;
        while (<$FILEA>) {
            next
                unless (substr($_,0,1) eq $char);
            # for each record save the lengthAndMD5 as the key and its start as the value
            $_h{lengthAndMD5($_)}=$pos;
            $pos=tell $FILEA;
            };
        my $_s;
        while (<$FILEB>) {
            next
                unless (substr($_,0,1) eq $char);
            if (exists $_h{$_s=lengthAndMD5($_)}) { # It's a duplicate
                print {$InBoth} $_;
                delete $_h{$_s};
                }
            else { # (Not in FILEA) It's only in FILEB
                print {$OnlyInB} $_;
                }
            };
        # only in FILEA
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        for my $key (keys %_h) { # Only in FILEA
            seek $FILEA,delete $_h{$key},0;
            print {$OnlyInA} scalar <$FILEA>;
            };
        # Should be empty
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        };
    };

close $OnlyInB
    or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
    or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
    or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
    or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
    or die "Could NOT close 'FileA.txt' after reading! $!";
exit;

    sub lengthAndMD5 {
        return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
        };

__END__

比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件

Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl

csv

perl

cpan

metacpan