如何拆分文件并并行处理它们，然后将它们拼接回去？ unix

Question

我有一个文本文件 infile.txt 是这样的：

abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?

文件中的每一行都会被这个perl命令处理成out.txt

`cat infile.txt | perl dosomething > out.txt`

想象一下，如果文本文件有 100,000,000 行。我想并行化 bash 命令所以我尝试了这样的事情：

$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt

但是有没有更简洁的方法来做同样的事情？

Answer 1

我自己从未尝试过，但 GNU parallel 可能值得一试。

这是手册页 (parallel(1)) 的摘录，与您当前所做的类似。它也可以通过其他方式拆分输入。

EXAMPLE: Processing a big file using more cores
       To process a big file or some output you can use --pipe to split up
       the data into blocks and pipe the blocks into the processing program.

       If the program is gzip -9 you can do:

       cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

       This will split bigfile into blocks of 1 MB and pass that to gzip -9
       in parallel. One gzip will be run per CPU core. The output of gzip -9
       will be kept in order and saved to bigfile.gz

这是否值得取决于您的处理CPU密集程度。对于简单的脚本，您将花费大部分时间在磁盘之间来回移动数据，而并行化不会给您带来太多好处。

您可以找到 GNU Parallel 作者 here 的一些介绍视频。

Answer 2

假设您的限制因素不是您的磁盘，您可以使用 fork() 在 perl 中执行此操作，特别是 Parallel::ForkManager:

#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my $max_forks = 8; #2x procs is usually optimal

sub process_line {
    #do something with this line
}

my $fork_manager = Parallel::ForkManager -> new ( $max_forks ); 

open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
    $fork_manager -> start and next;
    process_line ( $line );
    $fork_manager -> finish;
}

close ( $input );
$fork_manager -> wait_all_children();

这样做的缺点是合并输出。每个并行任务不一定按其开始的顺序完成，因此在序列化结果方面存在各种潜在问题。

您可以使用 flock 之类的方法解决这些问题，但您需要小心，因为太多的锁定操作首先会夺走您的并行优势。（因此我的第一个声明 - 如果您的限制因素是磁盘 IO，那么并行性无论如何都没有太大帮助）。

虽然有各种可能的解决方案 - perl 文档中为此写了一整章：perlipc - but keep in mind you can retrieve data with Parallel::ForkManager。

Answer 3

@Ulfalizer 的回答为您提供了有关解决方案的良好提示，但缺少一些细节。

您可以使用 GNU parallel（apt-get install parallel 在 Debian 上）

所以您的问题可以使用以下命令解决：

parallel -a infile.txt -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt

参数的含义如下：

-a: read input from file instead of stdin
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command

如何拆分文件并并行处理它们，然后将它们拼接回去？ unix

How to split files up and process them in parallel and then stitch them back? unix

unix

bash

perl

split

cat