编写高效的 Perl 代码来爬取大型目录

Writing Efficient Perl Code to Crawl Through Large Directories

我正在编写一个 Perl 脚本,它会在包含 300,000 多个文件的目录中进行爬网,并删除给定年份中除第一个文件之外的所有文件。我的问题是我的代码强制 Perl 基本上扫描 300,000 个文件的目录,估计大约 300,001 次。到目前为止,已经 运行 四天了,我希望你们有一些技巧,可以在将来使这样的代码更有效率。

脚本:

#!/usr/bin/perl
use Date::Calc qw(Delta_Days Decode_Date_EU);
# Note: must use default perl moudle on Killdevil (module add perl)

@base = (1993, 1, 1);
$count = 0;

@files = <*>; # Creates array of all files in directory
foreach $file (@files) {
    # Splits indivdual filename into an array seperated by
    # comma (CIK, 10, K, Year, Month, Date) indexed by 0-5
    @filearray = split(/\-/, $file);

    $cik = $filearray[0];
    $cikyear = $filearray[3];

    # Defines a new array as all files in directory with the
    # same CIK and year as our file
    @cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;

    $sizecik = @cikfiles;
    $best = 0; # Index for file with earliest date
    $bestsize = 1000000000000000000000000000; # Initial value to beat

    # Only run through the proccess if there are
    # multiple files with same CIK same year.
    if ($sizecik != 1) {

        for($i = 1; $i < $sizecik + 1; $i = $i + 1) {
            # Read filename and creates an array deliminated by "-"
            @filearray1 = split(/-/, $cikfiles[$i-1]);

            $year = $filearray1[3];
            $month = $filearray1[4];

            # Deletes leading zero from months if there exists one
            $month =~ s/^0//;
            $day = $filearray1[5];
            $day =~ s/^0//; # Removes leading zero

            # Calculates number of days from base year
            $dd = Delta_Days($base[0], $base[1], $base[2], $year, $month, $day);

            if ($dd < $bestsize) {
                # If has lower number of days than current best, index
                # this file as the new leader
                $best = $i;

                # Reset the size to beat to the dd of this file
                $bestsize = $dd;
            }
        }

        for ($i = 1; $i < $sizecik + 1; $i = $i + 1) {
            # Runs through current array and deletes all
            # files that are not the best
            if($i != $best) {
                $rm = "rm " . $cikfiles[$i-1];
                system($rm);
                $count = $count + 1;
            }
        }
    }
}

# Displays total number of files removed
print "Number of files deleted: $count";

close(MYOUTFILE);

如果不通过目录查找会不会更有效率

@cikfiles = grep { /^$cik-10-K-$cikyear/ } <*>;

我反而搜索了原始数组,然后删除了条目?

@cikfiles = grep { /^$cik-10-K-$cikyear/ } <@files>;

如何删除我从@files 数组中删除的元素?

无需多次扫描目录。扫描一次目录,收集您需要的信息。

如果日期格式为 YYYYMMDD,可以使用简单的字符串比较来确定两个日期中哪个日期较早。

my $opt_dry_run = 1;

my %files_by_cik_and_year;
while (<*>) {
   my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
   push @{ $files_by_cik_and_year{$cik}{$year} },
      [ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}

for my $cik (keys(%files_by_cik_and_year)) {
   for my $year (keys(%{ $files_by_cik_and_year{$cik} })) {
      my @files =
         map { $_->[0] }
            sort { $a->[1] cmp $b->[1] }
               @{ $files_by_cik_and_year{$cik}{$year} };

      shift(@files);

      for (@files) {
         print("Deleting $_\n");
         if (!$opt_dry_run) {
            unlink($_)
               or warn("Couldn't delete $_\n");
         }
      }
   }
}

简化版:

my $opt_dry_run = 1;

my %files_by_cik_and_year;
while (<*>) {
   my ($cik, undef, undef, $year, $month, $day) = split(/-/, $_);
   push @{ $files_by_cik_and_year{"$cik-$year"} },
      [ $_, sprintf("%04d%02d%02d", $year, $month, $day) ];
}

for (values(%files_by_cik_and_year)) {
   my @files =
      map { $_->[0] }
         sort { $a->[1] cmp $b->[1] }
            @$_;

   shift(@files);

   for (@files) {
      print("Deleting $_\n");
      if (!$opt_dry_run) {
         unlink($_)
            or warn("Couldn't delete $_\n");
      }
   }
}