实时逐行解析程序输出

Question

我有一个程序可以执行一些繁重的处理（ML 算法）并将大量数据（读取 GB 的纯文本）写入标准输出。在某些特定情况下，我只需要一小部分输出，但是现在我正在保存一个（巨大的）文本文件，然后解析其中的行以获取我的数据。

虽然完全有效，但我的方法非常有效。有没有办法避免生成这么大的文件（因为无论如何都会删除大部分数据），并逐行进行实时解析。

执行：

./myProgram model test > myOutput

我的输出内容（百万行）：

0, blah blah blah thousand of more blahs -> [ I care data inside brackets ]
0, blah blah blah thousand of more blahs -> [ I care data inside brackets ]
....

我认为最好的选择是使用 unix 管道来链接结果，但我不知道如何逐行发送数据让我们说给 python 或 java 应用来解析它。

./myProgram model test | <now what>

Answer 1

该管道正是这样做的。它将数据（可能缓冲）发送到管道 RHS 上的程序。

该程序可以随心所欲地对该数据进行操作。

grep、sed 和 awk 等程序以面向行的方式对该数据进行操作。

其他程序可以像 want/need 那样做其他事情。

Answer 2

要在脚本中读写数据你要使用过滤刚刚读写的数据from/to标准input/output.

./myProgram model test | ./filter.py > myOutput

filter.py:

import sys

for line in sys.stdin:
    if some_condition:
        sys.stdout.write(line)

如果条件只是在数据中有一些模式你不需要脚本，你可以简单地使用 grep 来过滤行：

 ./myProgram model test | grep 'interesting_pattern' > myOutput

Answer 3

./myProgram model test | now what

如果我正确理解您想要 [ I care data inside brackets ]（只是括号之间的数据），那么一种正确的方法是将输出通过管道传输到 sed，然后使用 backreference用括号内的内容替换文本行。所以 now what 是 sed -e 's/^.*[[]$.*$[]].*$//'。或者放在一起：

./myProgram model test | sed -e 's/^.*[[]\(.*\)[]].*$//' > myOutput

如果您的程序提供了所提供的输出，例如：

$ echo "0, blah blah blah thousand of more blahs -> [ I care data inside brackets ]" | 
sed -e 's/^.*[[]\(.*\)[]].*$//'
 I care data inside brackets

正则表达式的简单解释，分块看还是比较容易的：

 's/^.*[[]\(.*\)[]].*$//'

是 s/this/that/ 形式的简单替换表达式。查看第一部分（或 this），您有：

^.*   # from the beginning of the line, match all characters
[[]   # until you find the first open bracket [
\(    # begin saving the pattern that follows
.*    # all characters in this case
\)    # stop collecting the pattern
[]]   # before you encounter the close bracket ]
.*$   # and then all remaining characters in the line.

接下来 s/this/that/ 表达式的第二个（或 that）部分是一个 backreference 表示：

    # substitute the (1st) pattern you collected. All between \(...\) for the line.

当放在一起时简单地说 用括号之间的内容替换该行。（当然，如果我不明白你需要什么，这是一个很长的解释下管。）

Answer 4

如果输出确实是面向行的并且您想提取或处理其中的一些，请将输出通过管道传输到某些 awk 命令，即

 ./myProgram model test | awk ...

当然，用 awk 的适当参数替换 ...。详细了解 GNU awk（a.k.a。gawk）它专为此类任务而设计：

If you are like many computer users, you would frequently like to make changes in various text files wherever certain patterns appear, or extract data from parts of certain lines while discarding the rest. To write a program to do this in a language such as C or Pascal is a time-consuming inconvenience that may take many lines of code. The job is easy with awk, especially the GNU implementation: gawk.

或者，您可以修改您的原始 ./myProgram 以让它例如用 sqlite (an easy to use library) or with something more serious like PostGreSQL or MongoDb

填充一些数据库

实时逐行解析程序输出

Parsing program output line by line on-the-fly

linux

shell

parsing

stdout