GREP 用于文件中的动态模式并打印具有前一个模式和另一个模式的其他行
GREP for a dynamic pattern in a file and print the other lines having former pattern and another pattern
假设我有一个如下所示的日志文件:
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
我想做的是如果:
- 任何行包含
"Matched Line"
,我需要在第 4 列中获取唯一 ID(例如 06z07mjBYxFpzs
)并且,
- 搜索具有该唯一 ID 的其他行 + 文本
"Some Data xxyyzz"
并且,
- 在控制台上打印具有匹配模式 (unique id +
"Some Data xxyyzz"
) 的行作为最终输出。
所以在这种情况下输出应该是:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
我在这里说的文件是一个巨大的文件(~200 GB 文件;有数百万条记录),在共享服务器上,所以我不能 运行会花费大量时间的脚本或命令。
[编辑] - 目前通过 fgrep 通过在一个文件中打印 Matched Line
中的唯一 ID,在另一个文件中打印 Some Data xxyyzz
中的唯一 ID;但寻找单行 grep
、awk
或 sed
命令(无需创建多个文件到 fgrep
)
[EDIT 2] - 此输出不在文件中,而是一系列 grep
和 sort
的中间输出。
[编辑 3] - 更新样本输入(不按顺序但混乱):
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
grep "Matched Line" data.txt | awk '{print }' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
- 搜索所有 "Matched Lines"
- 打印到标准输出行中的第 4 个元素
- 对于输出中的每一行 运行 grep:搜索打印的 id
- 然后再次搜索但没有 "Matched Line"
输出:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
或者,使用 bash 的进程替换,我们可以将文件 data.txt
的读取次数减少到只有两次:
grep -f <(grep "Matched Line" data.txt | awk '{print }') data.txt | grep -v "Matched Line"
有序数据
以下只遍历文件一次,因此应该很快:
$ awk '/Matched Line/{id=;next;} id==' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
在示例输入(原始问题)中,所有 Some Data
行都紧跟在它们的 Matched Line
之后。这使得这个快速而简单的解决方案成为可能。
如何在管道中使用
awk 在管道中运行良好。如果输入不是来自文件,而是来自管道,如 Edit 2,则使用类似:
cmd1 <file.log | cmd2 | awk '/Matched Line/{id=;next;} id==' | cmd3
工作原理
/Matched Line/{id=;next;}
每当我们找到包含文本 Matched Line
的行时,我们都会将其 ID 保存在变量 id
中。由于我们不想打印 Matched Line
,因此我们告诉 awk 跳过其余命令并跳转到 next
行。
id==
只要当前行的 ID(字段 4)与我们保存的 id
匹配,我们就会打印该行。
(在 awk 术语中,id==
是一个条件:它的计算结果为真或假。当条件为真时,将执行操作。在这种情况下,我们没有指定任何操作,因此 awk 执行默认操作打印该行的操作。)
部分有序数据
在编辑 3 中,数据行可以出现在匹配行之后的某个随机位置。在那种情况下:
$ awk '/Matched Line/{id[]=1;next;} id[]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
或者,在管道中:
cmd1 file.log | awk '/Matched Line/{id[]=1;next;} id[]'
假设我有一个如下所示的日志文件:
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
我想做的是如果:
- 任何行包含
"Matched Line"
,我需要在第 4 列中获取唯一 ID(例如06z07mjBYxFpzs
)并且, - 搜索具有该唯一 ID 的其他行 + 文本
"Some Data xxyyzz"
并且, - 在控制台上打印具有匹配模式 (unique id +
"Some Data xxyyzz"
) 的行作为最终输出。
所以在这种情况下输出应该是:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
我在这里说的文件是一个巨大的文件(~200 GB 文件;有数百万条记录),在共享服务器上,所以我不能 运行会花费大量时间的脚本或命令。
[编辑] - 目前通过 fgrep 通过在一个文件中打印 Matched Line
中的唯一 ID,在另一个文件中打印 Some Data xxyyzz
中的唯一 ID;但寻找单行 grep
、awk
或 sed
命令(无需创建多个文件到 fgrep
)
[EDIT 2] - 此输出不在文件中,而是一系列 grep
和 sort
的中间输出。
[编辑 3] - 更新样本输入(不按顺序但混乱):
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716 INFO 06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759 INFO 06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755 INFO 06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827 INFO 06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855 INFO 06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861 INFO 06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873 INFO 06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902 INFO 06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970 INFO 06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991 INFO 06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085 INFO 06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094 INFO 06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094 INFO 06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095 INFO 06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123 INFO 06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132 INFO 06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
grep "Matched Line" data.txt | awk '{print }' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
- 搜索所有 "Matched Lines"
- 打印到标准输出行中的第 4 个元素
- 对于输出中的每一行 运行 grep:搜索打印的 id
- 然后再次搜索但没有 "Matched Line"
输出:
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
或者,使用 bash 的进程替换,我们可以将文件 data.txt
的读取次数减少到只有两次:
grep -f <(grep "Matched Line" data.txt | awk '{print }') data.txt | grep -v "Matched Line"
有序数据
以下只遍历文件一次,因此应该很快:
$ awk '/Matched Line/{id=;next;} id==' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
在示例输入(原始问题)中,所有 Some Data
行都紧跟在它们的 Matched Line
之后。这使得这个快速而简单的解决方案成为可能。
如何在管道中使用
awk 在管道中运行良好。如果输入不是来自文件,而是来自管道,如 Edit 2,则使用类似:
cmd1 <file.log | cmd2 | awk '/Matched Line/{id=;next;} id==' | cmd3
工作原理
/Matched Line/{id=;next;}
每当我们找到包含文本
Matched Line
的行时,我们都会将其 ID 保存在变量id
中。由于我们不想打印Matched Line
,因此我们告诉 awk 跳过其余命令并跳转到next
行。id==
只要当前行的 ID(字段 4)与我们保存的
id
匹配,我们就会打印该行。(在 awk 术语中,
id==
是一个条件:它的计算结果为真或假。当条件为真时,将执行操作。在这种情况下,我们没有指定任何操作,因此 awk 执行默认操作打印该行的操作。)
部分有序数据
在编辑 3 中,数据行可以出现在匹配行之后的某个随机位置。在那种情况下:
$ awk '/Matched Line/{id[]=1;next;} id[]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz
或者,在管道中:
cmd1 file.log | awk '/Matched Line/{id[]=1;next;} id[]'