faster/better 在用于 greping 文件的 for 循环和使用文件查询对文件进行 greping 之间的实践是什么？

Question

我曾经有过如下的脚本

for i in $(cat list.txt)
do
  grep $i sales.txt
done

哪里cat list.txt

tomatoes
peppers
onions

和cat sales.txt

Price Products
.88 bread
.75 tomatoes
.34 fish
.57 peppers
[=13=].95 beans
.56 onions

我是 BASH/SHELL 的初学者，在阅读了 Why is using a shell loop to process text considered bad practice? 等帖子后，我将之前的脚本更改为以下内容：

grep -f list.txt sales.txt

这最后一种方法真的比使用 for 循环更好吗？起初我以为是，但后来我意识到它可能是一样的，因为 grep 每次 grep 目标文件中的不同行时都必须读取查询文件。有谁知道它是否真的更好，为什么？如果它以某种方式更好，我可能会遗漏一些关于 grep 如何处理此任务的信息，但我无法弄清楚。

Answer 1

扩展我的评论...

您可以通过 git 下载 grep 的源代码：

 git clone https://git.savannah.gnu.org/git/grep.git

你可以在src/grep.c的第96行看到一条评论：

/* A list of lineno,filename pairs corresponding to -f FILENAME
   arguments. Since we store the concatenation of all patterns in
   a single array, KEYS, be they from the command line via "-e PAT"
   or read from one or more -f-specified FILENAMES.  Given this
   invocation, grep -f <(seq 5) -f <(seq 2) -f <(seq 3) FILE, there
   will be three entries in LF_PAIR: {1, x} {6, y} {8, z}, where
   x, y and z are just place-holders for shell-generated names.  */

这就是我们需要查看的所有线索，无论是通过 -e 还是通过 -f 与文件一起搜索的模式都被转储到一个数组中。该数组就是搜索的来源。在 C 中遍历该数组将比 shell 循环遍历文件更快。因此，仅此一项就能赢得速度竞赛。

此外，正如我在评论中提到的，grep -f list.txt sales.txt 更易于阅读、更易于维护，并且只需调用一个程序 (grep)。

Answer 2

你的第二个版本更好，因为：

它只需要一次通过文件（它不需要像你想象的那样多次通过）
它没有通配符和间距错误（您的第一次尝试在 green beans 或 /*/*/*/* 中表现不佳）

当 1. 正确地执行并且 2. 开销可以忽略不计时，纯粹在 shell 代码中读取文件是完全没问题的，但两者都不适用于您的第一个示例（除了文件目前很小）。

faster/better 在用于 greping 文件的 for 循环和使用文件查询对文件进行 greping 之间的实践是什么？

What is faster/better practice between a for loop for greping a file & greping a file with a file query?

bash

shell

grep

text-processing

loops