GREP 用于文件中的动态模式并打印具有前一个模式和另一个模式的其他行

GREP for a dynamic pattern in a file and print the other lines having former pattern and another pattern

假设我有一个如下所示的日志文件:

06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

我想做的是如果:

所以在这种情况下输出应该是:

06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz

我在这里说的文件是一个巨大的文件(~200 GB 文件;有数百万条记录),在共享服务器上,所以我不能 运行会花费大量时间的脚本或命令。

[编辑] - 目前通过 fgrep 通过在一个文件中打印 Matched Line 中的唯一 ID,在另一个文件中打印 Some Data xxyyzz 中的唯一 ID;但寻找单行 grepawksed 命令(无需创建多个文件到 fgrep

[EDIT 2] - 此输出不在文件中,而是一系列 grepsort 的中间输出。

[编辑 3] - 更新样本输入(不按顺序但混乱):

06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:20.716  INFO   06z07mjBYxFpzs Matched Line
06/30/2015 00:17:20.735  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.759  INFO   06z07mGDQ9thtY Some Data xxyyzz
06/30/2015 00:17:20.755  INFO   06z07mdgC66vHc Matched Line
06/30/2015 00:17:20.784  INFO   06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.827  INFO   06z07n2q9S4g07 Some Data xxyyzz
06/30/2015 00:17:20.855  INFO   06z07mxt44CF03 Some Data xxyyzz
06/30/2015 00:17:20.861  INFO   06z07n5mxfYkHg Some Data xxyyzz
06/30/2015 00:17:20.873  INFO   06z07nm473brzB Some Data xxyyzz
06/30/2015 00:17:20.723  INFO   06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.902  INFO   06z07mM059k0tZ Some Data xxyyzz
06/30/2015 00:17:20.970  INFO   06z07nx2lv9wzC Matched Line
06/30/2015 00:17:20.974  INFO   06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.991  INFO   06z07ngwMW16zz Matched Line
06/30/2015 00:17:21.085  INFO   06z07n42C6Qczx Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.094  INFO   06z07mxR42tZzw Some Data xxyyzz
06/30/2015 00:17:20.994  INFO   06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.094  INFO   06z07mWbfVCGD3 Some Data xxyyzz
06/30/2015 00:17:21.095  INFO   06z07nMgPJpPv1 Matched Line
06/30/2015 00:17:21.100  INFO   06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.123  INFO   06z07p0yBwLv0b Some Data xxyyzz
06/30/2015 00:17:21.132  INFO   06z07nSLzf66Hk Matched Line
06/30/2015 00:17:21.137  INFO   06z07nSLzf66Hk Some Data xxyyzz
grep "Matched Line" data.txt  | awk '{print }' | xargs -l1 -i grep {} data.txt | grep -v "Matched Line"
  1. 搜索所有 "Matched Lines"
  2. 打印到标准输出行中的第 4 个元素
  3. 对于输出中的每一行 运行 grep:搜索打印的 id
  4. 然后再次搜索但没有 "Matched Line"

输出:

06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 

或者,使用 bash 的进程替换,我们可以将文件 data.txt 的读取次数减少到只有两次:

grep -f <(grep "Matched Line" data.txt  | awk '{print }') data.txt | grep -v "Matched Line"

有序数据

以下只遍历文件一次,因此应该很快:

$ awk '/Matched Line/{id=;next;} id==' file.log
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz

在示例输入(原始问题)中,所有 Some Data 行都紧跟在它们的 Matched Line 之后。这使得这个快速而简单的解决方案成为可能。

如何在管道中使用

awk 在管道中运行良好。如果输入不是来自文件,而是来自管道,如 Edit 2,则使用类似:

cmd1 <file.log | cmd2 | awk '/Matched Line/{id=;next;} id==' | cmd3

工作原理

  • /Matched Line/{id=;next;}

    每当我们找到包含文本 Matched Line 的行时,我们都会将其 ID 保存在变量 id 中。由于我们不想打印 Matched Line,因此我们告诉 awk 跳过其余命令并跳转到 next 行。

  • id==

    只要当前行的 ID(字段 4)与我们保存的 id 匹配,我们就会打印该行。

    (在 awk 术语中,id== 是一个条件:它的计算结果为真或假。当条件为真时,将执行操作。在这种情况下,我们没有指定任何操作,因此 awk 执行默认操作打印该行的操作。)

部分有序数据

编辑 3 中,数据行可以出现在匹配行之后的某个随机位置。在那种情况下:

$ awk '/Matched Line/{id[]=1;next;} id[]' file.log
06/30/2015 00:17:20.784 INFO 06z07mdgC66vHc Some Data xxyyzz
06/30/2015 00:17:20.723 INFO 06z07mjBYxFpzs Some Data xxyyzz
06/30/2015 00:17:20.974 INFO 06z07nx2lv9wzC Some Data xxyyzz
06/30/2015 00:17:20.994 INFO 06z07ngwMW16zz Some Data xxyyzz
06/30/2015 00:17:21.100 INFO 06z07nMgPJpPv1 Some Data xxyyzz
06/30/2015 00:17:21.137 INFO 06z07nSLzf66Hk Some Data xxyyzz 

或者,在管道中:

cmd1 file.log | awk '/Matched Line/{id[]=1;next;} id[]'