比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件
Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl
我有 2 个包含多个字段(大约 30 个字段)的 csv 文件,而且大小很大(大约 4GB)。
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
我想比较这 2 个文件并将差异报告到另一个 csv 文件中。在上面的示例中,Employee Vivek 和 Karthick 详细信息出现在不同的行号中,但记录数据仍然相同,因此应视为匹配。员工 Vinoth 记录应被视为不匹配,因为地址不匹配。
输出 diff.csv 文件可以包含文件 1 和文件 2 中的不匹配记录,如下所示。
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
到目前为止,我已经编写了代码。在此之后我很困惑选择哪个选项是二进制搜索还是任何其他有效的方法来执行此操作。你能帮帮我吗?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts{'file1'} ) {
die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
die "CSV file --file2 not specified.\n";
}
my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
while (1) {
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
if ( !defined $f1line ) {
$f1empty = 'true';
}
if ( $f1empty eq 'true' ) {
last;
}
else {
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.
}
}
print "End of Program" . "\n";
在数据库中存储如此 数量级 的数据是处理此类任务的最正确方法。至少 SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL 会很好用。
以下代码演示了如何在没有特殊模块的情况下实现所需的输出,但它没有考虑可能 混乱的输入数据 。此脚本会将数据记录报告为不同,即使差异可能只是一个额外的 space.
默认输出到控制台 window,除非您指定选项 output。
注意:整个文件 #1 已读入内存,请耐心等待处理大文件可能需要一段时间。
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my @args = (
'file1|f1=s',
'file2|f2=s',
'output|o=s',
'debug|d',
'help|?',
'man|m'
);
GetOptions( \%opt, @args ) or pod2usage(2);
print Dumper(\%opt) if $opt{debug};
pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};
pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};
unlink $opt{output} if defined $opt{output} and -f $opt{output};
compare($opt{file1},$opt{file2});
sub compare {
my $fname1 = shift;
my $fname2 = shift;
my $hfile1 = file2hash($fname1);
open my $fh, '<:encoding(utf8)', $fname2
or die "Couldn't open $fname2";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
my($key,$data) = (, );
if( !defined $hfile1->{$key} ) {
my $msg = "$fname1 $key is missing";
say_msg($msg);
} elsif( $data ne $hfile1->{$key} ) {
my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
say_msg($msg);
}
}
}
sub say_msg {
my $msg = shift;
if( $opt{output} ) {
open my $fh, '>>:encoding(utf8)', $opt{output}
or die "Couldn't to open $opt{output}";
say $fh $msg;
close $fh;
} else {
say $msg;
}
}
sub file2hash {
my $fname = shift;
my %hash;
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
$hash{} = ;
}
close $fh;
return \%hash;
}
__END__
=head1 NAME
comp_cvs - compares two CVS files and stores differense
=head1 SYNOPSIS
comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt
Options:
-f1,--file1 input CVS filename #1
-f2,--file2 input CVS filename #2
-o,--output output filename
-d,--debug output debug information
-?,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-f1,--file1>
Input CVS filename #1
=item B<-f2,--file2>
Input CVS filename #2
=item B<-o,--output>
Output filename
=item B<-d,--debug>
Print debug information.
=item B<-?,--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=back
=head1 DESCRIPTION
B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.
=head1 EXIT STATUS
The section describes B<EXIT STATUS> codes of the program
=head1 ENVIRONMENT
The section describes B<ENVIRONMENT VARIABLES> utilized in the program
=head1 FILES
The section describes B<FILES> which used for program's configuration
=head1 EXAMPLES
The section demonstrates some B<EXAMPLES> of the code
=head1 REPORTING BUGS
The section provides information how to report bugs
=head1 AUTHOR
The section describing author and his contanct information
=head1 ACKNOWLEDGMENT
The section to give credits people in some way related to the code
=head1 SEE ALSO
The section describing related information - reference to other programs, blogs, website, ...
=head1 HISTORY
The section gives historical information related to the code of the program
=head1 COPYRIGHT
Copyright information related to the code
=cut
测试文件的输出
file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"
#!/usr/bin/env perl
use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;
my %POS;
my %chars;
open my $FILEA,'<',q{FileA.txt}
or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
$POS{FILEA}=tell $FILEA;
<$FILEB>,
$POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';
{ # Scan for first character of the records involved
while (<$FILEA>) {
$chars{substr($_,0,1)}++;
};
while (<$FILEB>) {
$chars{substr($_,0,1)}--;
};
# So what characters do we need to deal with?
warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
};
my @chars=sort keys %chars;
{
my %_h;
# For each of the characters in our character set
for my $char (@chars) {
warn Data::Dumper->Dump([$char],[qw(*char)]),' ';
# Beginning of data sections
seek $FILEA,$POS{FILEA},0;
seek $FILEB,$POS{FILEB},0;
%_h=();
my $pos=tell $FILEA;
while (<$FILEA>) {
next
unless (substr($_,0,1) eq $char);
# for each record save the lengthAndMD5 as the key and its start as the value
$_h{lengthAndMD5($_)}=$pos;
$pos=tell $FILEA;
};
my $_s;
while (<$FILEB>) {
next
unless (substr($_,0,1) eq $char);
if (exists $_h{$_s=lengthAndMD5($_)}) { # It's a duplicate
print {$InBoth} $_;
delete $_h{$_s};
}
else { # (Not in FILEA) It's only in FILEB
print {$OnlyInB} $_;
}
};
# only in FILEA
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
for my $key (keys %_h) { # Only in FILEA
seek $FILEA,delete $_h{$key},0;
print {$OnlyInA} scalar <$FILEA>;
};
# Should be empty
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
};
};
close $OnlyInB
or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
or die "Could NOT close 'FileA.txt' after reading! $!";
exit;
sub lengthAndMD5 {
return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
};
__END__
我有 2 个包含多个字段(大约 30 个字段)的 csv 文件,而且大小很大(大约 4GB)。
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
我想比较这 2 个文件并将差异报告到另一个 csv 文件中。在上面的示例中,Employee Vivek 和 Karthick 详细信息出现在不同的行号中,但记录数据仍然相同,因此应视为匹配。员工 Vinoth 记录应被视为不匹配,因为地址不匹配。
输出 diff.csv 文件可以包含文件 1 和文件 2 中的不匹配记录,如下所示。
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
到目前为止,我已经编写了代码。在此之后我很困惑选择哪个选项是二进制搜索还是任何其他有效的方法来执行此操作。你能帮帮我吗?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts{'file1'} ) {
die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
die "CSV file --file2 not specified.\n";
}
my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
while (1) {
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
if ( !defined $f1line ) {
$f1empty = 'true';
}
if ( $f1empty eq 'true' ) {
last;
}
else {
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.
}
}
print "End of Program" . "\n";
在数据库中存储如此 数量级 的数据是处理此类任务的最正确方法。至少 SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL 会很好用。
以下代码演示了如何在没有特殊模块的情况下实现所需的输出,但它没有考虑可能 混乱的输入数据 。此脚本会将数据记录报告为不同,即使差异可能只是一个额外的 space.
默认输出到控制台 window,除非您指定选项 output。
注意:整个文件 #1 已读入内存,请耐心等待处理大文件可能需要一段时间。
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my @args = (
'file1|f1=s',
'file2|f2=s',
'output|o=s',
'debug|d',
'help|?',
'man|m'
);
GetOptions( \%opt, @args ) or pod2usage(2);
print Dumper(\%opt) if $opt{debug};
pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};
pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};
unlink $opt{output} if defined $opt{output} and -f $opt{output};
compare($opt{file1},$opt{file2});
sub compare {
my $fname1 = shift;
my $fname2 = shift;
my $hfile1 = file2hash($fname1);
open my $fh, '<:encoding(utf8)', $fname2
or die "Couldn't open $fname2";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
my($key,$data) = (, );
if( !defined $hfile1->{$key} ) {
my $msg = "$fname1 $key is missing";
say_msg($msg);
} elsif( $data ne $hfile1->{$key} ) {
my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
say_msg($msg);
}
}
}
sub say_msg {
my $msg = shift;
if( $opt{output} ) {
open my $fh, '>>:encoding(utf8)', $opt{output}
or die "Couldn't to open $opt{output}";
say $fh $msg;
close $fh;
} else {
say $msg;
}
}
sub file2hash {
my $fname = shift;
my %hash;
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
$hash{} = ;
}
close $fh;
return \%hash;
}
__END__
=head1 NAME
comp_cvs - compares two CVS files and stores differense
=head1 SYNOPSIS
comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt
Options:
-f1,--file1 input CVS filename #1
-f2,--file2 input CVS filename #2
-o,--output output filename
-d,--debug output debug information
-?,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-f1,--file1>
Input CVS filename #1
=item B<-f2,--file2>
Input CVS filename #2
=item B<-o,--output>
Output filename
=item B<-d,--debug>
Print debug information.
=item B<-?,--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=back
=head1 DESCRIPTION
B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.
=head1 EXIT STATUS
The section describes B<EXIT STATUS> codes of the program
=head1 ENVIRONMENT
The section describes B<ENVIRONMENT VARIABLES> utilized in the program
=head1 FILES
The section describes B<FILES> which used for program's configuration
=head1 EXAMPLES
The section demonstrates some B<EXAMPLES> of the code
=head1 REPORTING BUGS
The section provides information how to report bugs
=head1 AUTHOR
The section describing author and his contanct information
=head1 ACKNOWLEDGMENT
The section to give credits people in some way related to the code
=head1 SEE ALSO
The section describing related information - reference to other programs, blogs, website, ...
=head1 HISTORY
The section gives historical information related to the code of the program
=head1 COPYRIGHT
Copyright information related to the code
=cut
测试文件的输出
file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"
#!/usr/bin/env perl
use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;
my %POS;
my %chars;
open my $FILEA,'<',q{FileA.txt}
or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
$POS{FILEA}=tell $FILEA;
<$FILEB>,
$POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';
{ # Scan for first character of the records involved
while (<$FILEA>) {
$chars{substr($_,0,1)}++;
};
while (<$FILEB>) {
$chars{substr($_,0,1)}--;
};
# So what characters do we need to deal with?
warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
};
my @chars=sort keys %chars;
{
my %_h;
# For each of the characters in our character set
for my $char (@chars) {
warn Data::Dumper->Dump([$char],[qw(*char)]),' ';
# Beginning of data sections
seek $FILEA,$POS{FILEA},0;
seek $FILEB,$POS{FILEB},0;
%_h=();
my $pos=tell $FILEA;
while (<$FILEA>) {
next
unless (substr($_,0,1) eq $char);
# for each record save the lengthAndMD5 as the key and its start as the value
$_h{lengthAndMD5($_)}=$pos;
$pos=tell $FILEA;
};
my $_s;
while (<$FILEB>) {
next
unless (substr($_,0,1) eq $char);
if (exists $_h{$_s=lengthAndMD5($_)}) { # It's a duplicate
print {$InBoth} $_;
delete $_h{$_s};
}
else { # (Not in FILEA) It's only in FILEB
print {$OnlyInB} $_;
}
};
# only in FILEA
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
for my $key (keys %_h) { # Only in FILEA
seek $FILEA,delete $_h{$key},0;
print {$OnlyInA} scalar <$FILEA>;
};
# Should be empty
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
};
};
close $OnlyInB
or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
or die "Could NOT close 'FileA.txt' after reading! $!";
exit;
sub lengthAndMD5 {
return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
};
__END__