Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl

我有 2 个包含多个字段(大约 30 个字段)的 csv 文件,而且大小很大(大约 4GB)。


Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"


Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"

我想比较这 2 个文件并将差异报告到另一个 csv 文件中。在上面的示例中,Employee Vivek 和 Karthick 详细信息出现在不同的行号中,但记录数据仍然相同,因此应视为匹配。员工 Vinoth 记录应被视为不匹配,因为地址不匹配。

输出 diff.csv 文件可以包含文件 1 和文件 2 中的不匹配记录,如下所示。

F1, Vinoth,12,2548.245,"140,North Street,India" 
F2, Vinoth,12,2548.245,"140,North Street,USA"


My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.

use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );

# Get Command Line Parameters

my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
  or die("Error in command line arguments\n");

if ( !defined $opts{'file1'} ) {
    die "CSV file --file1 not specified.\n";
if ( !defined $opts{'file2'} ) {
    die "CSV file --file2 not specified.\n";

my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';

print $file2 . "\n";

my $csv1 =
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );

open( my $fh1, '<:encoding(utf8)', $file1 )
  or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
  or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
  or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );

my $f1line   = undef;
my $f2line   = undef;
my $header1  = undef;
my $f1empty  = 'false';
my $f2empty  = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
    $f1line = $csv1->getline($fh1);
while (1) {

    if ( $f1empty eq 'false' ) {
        $f1line = $csv1->getline($fh1);
    if ( !defined $f1line ) {
        $f1empty = 'true';

    if ( $f1empty eq 'true' ) {
    else {
        ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.


print "End of Program" . "\n";

在数据库中存储如此 数量级 的数据是处理此类任务的最正确方法。至少 SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL 会很好用。

以下代码演示了如何在没有特殊模块的情况下实现所需的输出,但它没有考虑可能 混乱的输入数据 。此脚本会将数据记录报告为不同,即使差异可能只是一个额外的 space.

默认输出到控制台 window,除非您指定选项 output

注意:整个文件 #1 已读入内存,请耐心等待处理大文件可能需要一段时间。

use strict;
use warnings;
use feature 'say';

use Getopt::Long qw(GetOptions);
use Pod::Usage;

my %opt;
my @args = (

GetOptions( \%opt, @args ) or pod2usage(2);

print Dumper(\%opt) if $opt{debug};

pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};

pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};

unlink $opt{output} if defined $opt{output} and -f $opt{output};


sub compare {
    my $fname1 = shift;
    my $fname2 = shift;

    my $hfile1 = file2hash($fname1);

    open my $fh, '<:encoding(utf8)', $fname2
        or die "Couldn't open $fname2";

    while(<$fh>) {
        next unless /^(.*?),(.*)$/;
        my($key,$data) = (, );
        if( !defined $hfile1->{$key} ) {
            my $msg = "$fname1 $key is missing";
        } elsif( $data ne $hfile1->{$key} ) {
            my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";

sub say_msg {
    my $msg = shift;

    if( $opt{output} ) {
        open my $fh, '>>:encoding(utf8)', $opt{output}
            or die "Couldn't to open $opt{output}";

        say $fh $msg;

        close $fh;
    } else {
        say $msg;

sub file2hash {
    my $fname = shift;
    my %hash;

    open my $fh, '<:encoding(utf8)', $fname
        or die "Couldn't open $fname";

    while(<$fh>) {
        next unless /^(.*?),(.*)$/;
        $hash{} = ;


    close $fh;

    return \%hash;


=head1 NAME

comp_cvs - compares two CVS files and stores differense 


 comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt

    -f1,--file1 input CVS filename #1
    -f2,--file2 input CVS filename #2
    -o,--output output filename
    -d,--debug  output debug information
    -?,--help   brief help message
    -m,--man    full documentation

=head1 OPTIONS

=over 4

=item B<-f1,--file1>

Input CVS filename #1

=item B<-f2,--file2>

Input CVS filename #2

=item B<-o,--output>

Output filename

=item B<-d,--debug>

Print debug information.

=item B<-?,--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.



=head1 FILES

=head1 AUTHOR

=head1 SEE ALSO

=head1 HISTORY

file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"
#!/usr/bin/env perl

use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;

my %POS;
my %chars;

open my $FILEA,'<',q{FileA.txt}
    or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
    or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
    or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
    or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
    or die "Can't open 'OnlyInB.txt' for writing! $!";
    $POS{FILEA}=tell $FILEA;
    $POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';

{ # Scan for first character of the records involved
    while (<$FILEA>) {
    while (<$FILEB>) {
    # So what characters do we need to deal with?
    warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
my @chars=sort keys %chars;
    my %_h;
    # For each of the characters in our character set
    for my $char (@chars) {
        warn Data::Dumper->Dump([$char],[qw(*char)]),' ';
        # Beginning of data sections
        seek $FILEA,$POS{FILEA},0;
        seek $FILEB,$POS{FILEB},0;
        my $pos=tell $FILEA;
        while (<$FILEA>) {
                unless (substr($_,0,1) eq $char);
            # for each record save the lengthAndMD5 as the key and its start as the value
            $pos=tell $FILEA;
        my $_s;
        while (<$FILEB>) {
                unless (substr($_,0,1) eq $char);
            if (exists $_h{$_s=lengthAndMD5($_)}) { # It's a duplicate
                print {$InBoth} $_;
                delete $_h{$_s};
            else { # (Not in FILEA) It's only in FILEB
                print {$OnlyInB} $_;
        # only in FILEA
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        for my $key (keys %_h) { # Only in FILEA
            seek $FILEA,delete $_h{$key},0;
            print {$OnlyInA} scalar <$FILEA>;
        # Should be empty
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';

close $OnlyInB
    or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
    or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
    or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
    or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
    or die "Could NOT close 'FileA.txt' after reading! $!";

    sub lengthAndMD5 {
        return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
