编写高效的 Perl 代码来爬取大型目录
Writing Efficient Perl Code to Crawl Through Large Directories
我正在编写一个 Perl 脚本,它会在包含 300,000 多个文件的目录中进行爬网,并删除给定年份中除第一个文件之外的所有文件。我的问题是我的代码强制 Perl 基本上扫描 300,000 个文件的目录,估计大约 300,001 次。到目前为止,已经 运行 四天了,我希望你们有一些技巧,可以在将来使这样的代码更有效率。
脚本:
#!/usr/bin/perl
use Date::Calc qw(Delta_Days Decode_Date_EU);
# Note: must use default perl moudle on Killdevil (module add perl)
@base = (1993, 1, 1);
$count = 0;
@files = <*>; # Creates array of all files in directory
foreach $file (@files) {
# Splits indivdual filename into an array seperated by
# comma (CIK, 10, K, Year, Month, Date) indexed by 0-5
@filearray = split(/\-/, $file);
$cik = $filearray[0];
$cikyear = $filearray[3];
# Defines a new array as all files in directory with the
# same CIK and year as our file
@cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;
$sizecik = @cikfiles;
$best = 0; # Index for file with earliest date
$bestsize = 1000000000000000000000000000; # Initial value to beat
# Only run through the proccess if there are
# multiple files with same CIK same year.
if ($sizecik != 1) {
for($i = 1; $i < $sizecik + 1; $i = $i + 1) {
# Read filename and creates an array deliminated by "-"
@filearray1 = split(/-/, $cikfiles[$i-1]);
$year = $filearray1[3];
$month = $filearray1[4];
# Deletes leading zero from months if there exists one
$month =~ s/^0//;
$day = $filearray1[5];
$day =~ s/^0//; # Removes leading zero
# Calculates number of days from base year
$dd = Delta_Days($base[0], $base[1], $base[2], $year, $month, $day);
if ($dd < $bestsize) {
# If has lower number of days than current best, index
# this file as the new leader
$best = $i;
# Reset the size to beat to the dd of this file
$bestsize = $dd;
}
}
for ($i = 1; $i < $sizecik + 1; $i = $i + 1) {
# Runs through current array and deletes all
# files that are not the best
if($i != $best) {
$rm = "rm " . $cikfiles[$i-1];
system($rm);
$count = $count + 1;
}
}
}
}
# Displays total number of files removed
print "Number of files deleted: $count";
close(MYOUTFILE);
如果不通过目录查找会不会更有效率
@cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;
我反而搜索了原始数组,然后删除了条目?
@cikfiles = grep { /^$cik-10-K-$cikyear/ } <@files>;
如何删除我从@files 数组中删除的元素?
无需多次扫描目录。扫描一次目录,收集您需要的信息。
如果日期格式为 YYYYMMDD
,可以使用简单的字符串比较来确定两个日期中哪个日期较早。
my $opt_dry_run = 1;
my %files_by_cik_and_year;
while (<*>) {
my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
push @{ $files_by_cik_and_year{$cik}{$year} },
[ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}
for my $cik (keys(%files_by_cik_and_year)) {
for my $year (keys(%{ $files_by_cik_and_year{$cik} })) {
my @files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
@{ $files_by_cik_and_year{$cik}{$year} };
shift(@files);
for (@files) {
print("Deleting $_\n");
if (!$opt_dry_run) {
unlink($_)
or warn("Couldn't delete $_\n");
}
}
}
}
简化版:
my $opt_dry_run = 1;
my %files_by_cik_and_year;
while (<*>) {
my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
push @{ $files_by_cik_and_year{"$cik-$year"} },
[ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}
for (values(%files_by_cik_and_year)) {
my @files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
@$_;
shift(@files);
for (@files) {
print("Deleting $_\n");
if (!$opt_dry_run) {
unlink($_)
or warn("Couldn't delete $_\n");
}
}
}
我正在编写一个 Perl 脚本,它会在包含 300,000 多个文件的目录中进行爬网,并删除给定年份中除第一个文件之外的所有文件。我的问题是我的代码强制 Perl 基本上扫描 300,000 个文件的目录,估计大约 300,001 次。到目前为止,已经 运行 四天了,我希望你们有一些技巧,可以在将来使这样的代码更有效率。
脚本:
#!/usr/bin/perl
use Date::Calc qw(Delta_Days Decode_Date_EU);
# Note: must use default perl moudle on Killdevil (module add perl)
@base = (1993, 1, 1);
$count = 0;
@files = <*>; # Creates array of all files in directory
foreach $file (@files) {
# Splits indivdual filename into an array seperated by
# comma (CIK, 10, K, Year, Month, Date) indexed by 0-5
@filearray = split(/\-/, $file);
$cik = $filearray[0];
$cikyear = $filearray[3];
# Defines a new array as all files in directory with the
# same CIK and year as our file
@cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;
$sizecik = @cikfiles;
$best = 0; # Index for file with earliest date
$bestsize = 1000000000000000000000000000; # Initial value to beat
# Only run through the proccess if there are
# multiple files with same CIK same year.
if ($sizecik != 1) {
for($i = 1; $i < $sizecik + 1; $i = $i + 1) {
# Read filename and creates an array deliminated by "-"
@filearray1 = split(/-/, $cikfiles[$i-1]);
$year = $filearray1[3];
$month = $filearray1[4];
# Deletes leading zero from months if there exists one
$month =~ s/^0//;
$day = $filearray1[5];
$day =~ s/^0//; # Removes leading zero
# Calculates number of days from base year
$dd = Delta_Days($base[0], $base[1], $base[2], $year, $month, $day);
if ($dd < $bestsize) {
# If has lower number of days than current best, index
# this file as the new leader
$best = $i;
# Reset the size to beat to the dd of this file
$bestsize = $dd;
}
}
for ($i = 1; $i < $sizecik + 1; $i = $i + 1) {
# Runs through current array and deletes all
# files that are not the best
if($i != $best) {
$rm = "rm " . $cikfiles[$i-1];
system($rm);
$count = $count + 1;
}
}
}
}
# Displays total number of files removed
print "Number of files deleted: $count";
close(MYOUTFILE);
如果不通过目录查找会不会更有效率
@cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;
我反而搜索了原始数组,然后删除了条目?
@cikfiles = grep { /^$cik-10-K-$cikyear/ } <@files>;
如何删除我从@files 数组中删除的元素?
无需多次扫描目录。扫描一次目录,收集您需要的信息。
如果日期格式为 YYYYMMDD
,可以使用简单的字符串比较来确定两个日期中哪个日期较早。
my $opt_dry_run = 1;
my %files_by_cik_and_year;
while (<*>) {
my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
push @{ $files_by_cik_and_year{$cik}{$year} },
[ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}
for my $cik (keys(%files_by_cik_and_year)) {
for my $year (keys(%{ $files_by_cik_and_year{$cik} })) {
my @files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
@{ $files_by_cik_and_year{$cik}{$year} };
shift(@files);
for (@files) {
print("Deleting $_\n");
if (!$opt_dry_run) {
unlink($_)
or warn("Couldn't delete $_\n");
}
}
}
}
简化版:
my $opt_dry_run = 1;
my %files_by_cik_and_year;
while (<*>) {
my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
push @{ $files_by_cik_and_year{"$cik-$year"} },
[ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}
for (values(%files_by_cik_and_year)) {
my @files =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
@$_;
shift(@files);
for (@files) {
print("Deleting $_\n");
if (!$opt_dry_run) {
unlink($_)
or warn("Couldn't delete $_\n");
}
}
}