在几个文件夹上加速 GNU 查找

Question

在 Linux 64 位 CentOS 服务器上，我运行在几个文件夹上执行 GNU 查找命令，每个文件夹都包含相似的子文件夹结构。结构是：

/my/group/folder/project_123/project_123-12345678/*/*file_pattern_at_this_level*
/my/group/folder/project_234/project_234-23456789/*/*file_pattern_at_this_level*

文件夹星号/*/表示每个项目文件夹中都有一堆子文件夹，名称各不相同。

我尝试添加最后一个星号，然后将查找命令限制为某个 -mindepth N 和 -maxdepth N:

find $folder1 $folder2 $folder3 -mindepth 1 -maxdepth 1 -name "*file_pattern*"

但测试是在具有其他运行作业的服务器节点上进行的，因此很难获得公平的性能比较，这也主要是由于在第一个命令之后发生了某种级别的缓存，这使得第一种命令较慢，第二种等效类型较快。

这是一个多核节点，那么我还能尝试什么来使此类命令更快？

Answer 1

"Actually commands like find and grep are almost always IO-bound: the disk is the bottleneck, not the CPU. In such cases, if you run several instances in parallel, they will compete for I/O bandwidth and cache, and so they will be slower." - https://unix.stackexchange.com/a/111409

不要担心 "finding" 文件，担心您需要用它们做什么。为此，您可以与 "parallel" 或 "xargs".

并行化

如果您仍然想这样做，您仍然可以尝试将 "parallel" 与 find 一起使用，传递目录列表。这将导致并行生成一堆查找进程（-j 选项设置有多少 "threads" 将同时运行）来处理 "queue"。在这种情况下，您将需要将 std out 设置到一个文件中，这样您可以稍后查看输出，也可以不查看，具体取决于您的使用。

在几个文件夹上加速 GNU 查找

speeding up GNU find on several folders

linux

bash

gnu-coreutils