如何使用 Perl 将完整的树结构添加到 .tar.bz2 文件中?

How to add complete tree structure into a .tar.bz2 file with Perl?

我希望将分布在大量子目录中的大量 数据压缩到存档中。我不能简单地使用内置 tar 函数,因为我需要我的 Perl 脚本在 Windows 和 Linux 环境中工作。我找到了 Archive::Tar 模块,但是 their documentation 给出了警告:

Note that this method [create_archive()] does not write on the fly as it were; it still reads all the files into memory before writing out the archive. Consult the FAQ below if this is a problem.

由于我的数据量很大,我想写'on the fly'。但是我无法在常见问题解答中找到有关 编写 文件的有用信息。他们建议使用迭代器 iter():

Returns an iterator function that reads the tar file without loading it all in memory. Each time the function is called it will return the next file in the tarball.

my $next = Archive::Tar->iter( "example.tar.gz", 1, {filter => qr/\.pm$/} );
while( my $f = $next->() ) {
    print $f->name, "\n";
    $f->extract or warn "Extraction failed";
    # ....
}

但这只讨论文件的,而不是压缩包的。所以我的问题是,如何以内存友好的方式获取目录 $dir 并使用 bzip2 压缩将其递归添加到存档 archive.tar.bz2 中,即不首先将整个树加载到内存中?

我尝试使用 Archive::Tar::StreamedIO::Compress::Bzip2 根据评论中的建议构建自己的脚本,但无济于事。

use strict;
use warnings;

use Archive::Tar::Streamed;
use File::Spec qw(catfile);
use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error);

my ($in_d, $out_tar, $out_bz2) = @ARGV;

open(my $out_fh,'>', $out_tar) or die "Couldn't create archive";
binmode $out_fh;

my $tar = Archive::Tar::Streamed->new($out_fh);

opendir(my $in_dh, $in_d) or die "Could not opendir '$in_d': $!";
while (my $in_f = readdir $in_dh) {
  next unless ($in_f =~ /\.xml$/);
  print STDOUT "Processing $in_f\r";
  $in_f = File::Spec->catfile($in_d, $in_f);
  $tar->add($in_f);
}

print STDOUT "\nBzip'ing $out_tar\r";

 bzip2 $out_tar => $out_bz2
    or die "Bzip2 failed: $Bzip2Error\n";

很快,我的系统内存不足。我当前的系统中有 32GB 可用空间,但它几乎立即被淹没了。我尝试添加到存档的目录中的某些文件超过 32GB。

所以我想知道是否即使在 Streamed class 中每个文件都必须在添加到存档之前完全在内存中读取?我假设文件本身会在缓冲区中流式传输到存档,但也许只是不是首先将所有文件保存在内存中,Streamed 允许完全只需要内存中的一个文件,然后将其添加到存档中, 一个接一个?

不幸的是,你想要的 is not possible 在 Perl 中:

I agree, it would be nice if this module could write the files in chunks and then rewrite the headers afterwards (to maintain the relationship of Archive::Tar doing the writing). You could maybe walk the archive backwards knowing you split the file into N entries, remove the extra headers, and update the first header with the sum of their sizes.

At the moment the only options are: use Archive::Tar::File, split the data into manageable sizes outside of perl, or use the tar command directly (to use it from perl, there's a nice wrapper on CPAN: Archive::Tar::Wrapper).

I don't think we'll ever have a truly non-memory-resident tar implementation in Perl based on Archive::Tar. To be honest, Archive::Tar itself needs to be rewritten or succeeded by something else.

这是我的解决方案的原始版本,它仍然在内存中存储整个文件。我今天可能没有时间添加只存储部分文件的更新,因为 Archive::Tar 模块没有最友好的 API

use strict;
use warnings 'all';
use autodie; # Remove need for checks on IO calls

use File::Find 'find';
use Archive::Tar::Streamed ();
use Compress::Raw::Bzip2;
use Time::HiRes qw/ gettimeofday tv_interval /;

# Set a default root directory for testing
#
BEGIN {
    our @ARGV;
    @ARGV = 'E:\test' unless @ARGV;
}

use constant ROOT_DIR => shift;

use constant KB => 1024;
use constant MB => KB * KB;
use constant GB => MB * KB;

STDOUT->autoflush; # Make sure console output isn't buffered

my $t0 = [ gettimeofday ];

# Create a pipe, and fork a child that will build a tar archive
# from the files and pass the result to the pipe as it is built
#
# The parent reads from the pipe and passes each chunk to the
# module for compression. The result of zipping each block is
# written directly to the bzip2 file
#
pipe( my $pipe_from_tar, my $pipe_to_parent );  # Make our pipe
my $pid  = fork;                      # fork the process

if ( $pid == 0 ) {    # child builds tar and writes it to the pipe

    $pipe_from_tar->close;    # Close the parent side of the pipe
    $pipe_to_parent->binmode;
    $pipe_to_parent->autoflush; 

    # Create the ATS object, specifiying that the tarred output
    # will be passed straight to the pipe
    #
    my $tar = Archive::Tar::Streamed->new( $pipe_to_parent );

    find(sub {

        my $file = File::Spec->canonpath( $File::Find::name );
        $tar->add( $file );

        print "Processing $file\n" if -d;

    }, ROOT_DIR );

    $tar->writeeof; # This is undocumented but essential

    $pipe_to_parent->close;
}
else {    # parent reads the tarred data, bzips it, and writes it to the file

    $pipe_to_parent->close; # Close the child side of the pipe
    $pipe_from_tar->binmode;

    open my $bz2_fh, '>:raw', 'T:\test.tar.bz2';
    $bz2_fh->autoflush;

    # The first parameter *must* have a value of zero. The default
    # is to accumulate each zipped chunnk into the output variable,
    # whereas we want to write each chunk to a file
    #
    my ( $bz, $status ) = Compress::Raw::Bzip2->new( 0 );
    defined $bz or die "Cannot create bunzip2 object: $status\n";

    my $zipped;

    while ( my $len = read $pipe_from_tar, my $buff, 8 * MB ) {

        $status = $bz->bzdeflate( $buff, $zipped );
        $bz2_fh->print( $zipped ) if length $zipped;
    }

    $pipe_from_tar->close;

    $status = $bz->bzclose( $zipped );
    $bz2_fh->print( $zipped ) if length $zipped;

    $bz2_fh->close;

    my $elapsed = tv_interval( $t0 );

    printf "\nProcessing took %s\n", hms($elapsed);
}


use constant MINUTE => 60;
use constant HOUR   => MINUTE * 60;

sub hms {
    my ($s) = @_;

    my @ret;

    if ( $s > HOUR ) {
        my $h = int($s / HOUR);
        $s -= $h * HOUR;
        push @ret, "${h}h";
    }

    if ( $s > MINUTE or @ret ) {
        my $m = int($s / MINUTE);
        $s -= $m * MINUTE;
        push @ret, "${m}m";
    }

    push @ret, sprintf "%.1fs", $s;

    "@ret";
}