命令输出在重定向时被破坏

Question

给定一个包含几百万个文件的目录，我们想从这些文件中提取一些数据。

find /dir/ -type f | awk -F"|" ' ~ /string/{ print "|" }' > the_good_stuff.txt

这永远不会扩展，所以我们引入 xargs。

find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" ' ~ /string/{ print "|" }'

无论我们运行多长时间，它都会产生有效的输出。太棒了，让我们通过在该命令上附加 > the_good_stuff_from_xargs.txt 将其写入文件。除了现在文件包含损坏的行。

令我印象深刻的是，在我的终端中查看 xargs 作为 STDOUT 打开的六个子进程的输出时，数据看起来很好。数据被重定向到文件系统的那一刻就是出现损坏的时候。

我尝试在命令中附加以下内容。

> myfile.txt

>> myfile.txt

| mawk '{print [=15=]}' > myfile.txt

以及各种其他重定向或其他概念 "pooling" xargs 的输出，然后将其写入磁盘，每个版本的数据都已损坏。

我确定原始文件没有格式错误。我很肯定，当在终端中将其视为 stdout 时，带有 xargs 的命令会产生长达 10 分钟的有效输出，盯着它吐出文本...

本地盘是SSD...我正在从同一个文件系统读取和写入。

为什么重定向 find /dir/ -type f -print0 | xargs -0 -n1 -P6 awk -F"|" ' ~ /string/{ print "|" }' 的输出会导致数据格式错误？

编辑

我目前无法安装 unbuffer，但 stdbuf -oL -eL 将命令输出修改为行缓冲，因此理论上应该做同样的事情。

我试过 stdbuf xargs cmd 和 xargs stdbuf cmd 都导致了非常断线。

需要 -P6 才能在任何合理的时间内完成此命令。

编辑 2

澄清一下... xargs 和它的 -P6 标志是解决问题的必要条件，因为我们正在处理的目录有数百万个必须扫描的文件。

显然我们可以删除 -P6 或以其他方式停止运行同时执行多个作业，但这并不能真正回答 为什么 输出被破坏，也不是如何输出可以恢复到 "correct" 状态，同时仍然大规模完成任务的现实方法。

解决方案

使用 parallel 提到的公认答案在所有答案中效果最好。

我运行的最终命令看起来像。 time find -L /dir/ -type f -mtime -30 -print0 | parallel -0 -X awk -f manual.awk > the_good_stuff.txt awk 很难，所以我将 -F"|" 移动到命令本身。默认情况下，并行会在盒子上的每个核心启动一个作业，如果需要，您可以使用 -j 将作业数设置得更低。

用真正的科学术语来说，这是一个巨大的速度提升。花费了无法衡量的小时数（可能超过 6 小时）的事情在 6 分钟后完成了 10%，因此很可能会在一个小时内完成。

一个问题是您必须确保 parallel 中的命令运行ning 没有尝试写入文件...这有效地绕过了并行在工作运行s!

最后没有 -X 类似于 xargs -n1 的并行行为。

Answer 1

我只会执行以下操作：

cat /${dir}/* | awk ' ~ /string*/{ print  "|"  }' >> `date`.txt

其中文件以进程运行的日期和时间命名。

Answer 2

man xargs提到这个问题："Please note that it is up to the called processes to properly manage parallel access to shared resources. For example, if more than one of them tries to print to stdout, the ouptut will be produced in an indeterminate order (and very likely mixed up)"

幸运的是，有一种方法可以使这个操作快一个数量级并同时解决 mangling 问题：

find /dir/ -type f -print0 | xargs -0 awk -F"|" ' ~ /string/{ print "|" }'

为什么？

-P6 正在打乱你的输出，所以不要使用它。 xargs -n1 为每个文件启动一个 awk 进程，而没有 n1，xargs 启动更少的 awk 进程，如下所示：

files | xargs -n1 awk
=>
awk file1
awk file2
...
awk fileN

vs

files | xargs awk
=>
awk file1 file2 ... fileN # or broken into a few awk commands if many files

i 运行你的代码在 ~20k 个文本文件上，每个文件的大小都在 ~20k 左右 -n1 -P6:

with -n1 -P6  23.138s
without        3.356s

如果你想在没有 xargs 的标准输出改组的情况下进行并行处理，请使用 gnu parallel（Gordon Davisson 也建议），例如：

find /dir/ -type f -print0 | parallel --xargs -0 -q awk -F"|" ' ~ /string/{ print "|" }'

注意：-q 需要引用命令字符串，否则 -F"|" 和 awk 代码周围的引号在 parallel 运行时会变成未引用。

parallel 节省了一些时间，但不如放弃 -n1 那样：

parallel       1.704s

ps：引入 cat（马特在他的回答中这样做）比 xargs awk:

快一点

xargs awk        3.356s
xargs cat | awk  3.036s

命令输出在重定向时被破坏

Command output mangled on redirection

linux

filesystems

io

bash

redirect