用于文本处理的 Regex 与 readline

Question

我有一个文本要处理（路由器输出）并从中生成有用的数据结构（字典中的键作为 iface 名称，值作为数据包计数）。我有两种方法来完成相同的任务。我想知道我应该使用哪一种来提高效率，对于更大的数据样本，哪一种看起来更容易失败。

Readline1 从 readline 中获取一个列表并处理输出并写入字典，其中键作为接口名称，值作为接下来的三项。

Readline2 使用 re 模块并匹配组，并从组写入字典键和值。

这些函数的输入self.output将是这样的：

message = 
"""
Interface 1/1\n\t
    input : 1234\n\t
    output : 3456\n\t
    dropped : 12\n
\n
Interface 1/2\n\t
    input : 7123\n\t
    output : 2345\n\t
    dropped : 31\n\t
"""

def ReadLine1(self):
    lines = self.output.splitlines()
    for index, line in enumerate(lines):
        if "Interface" in line:
            valuelist = []
            for i in [1,2,3]:
                valuelist.append((lines[index+i].split(':'))[1].strip())
            self.IFlist[line.split()[1]] = valuelist
    return self.IFlist

def Readline2(self):
    #print repr(self.output)
    n = re.compile(r"\n*Interface (./.)\n\s*input : ([0-9]+)\n\s*output : ([0-9]+)\n\s*dropped : ([0-9]+)",re.MULTILINE|re.DOTALL)
    blocks = self.output.split('\n\n')
    for block in blocks:
        m_object = re.match(n, block)
        self.IFlist[m_object.group(1)] = [m_object.group(i) for i in (2,3,4)]

Answer 1

您的两种方法都使用格式的特定方面来实现您尝试进行的解析，如果该格式被更改/破坏，其中一种方法也可能会破坏...

例如，如果您在两个条目之间的空行中添加了一个 space（您看不到），那么 blocks = self.output.split('\n\n') 将无法找到两个连续的换行符，并且正则表达式版本将错过第二个条目：

{'1/1': ['1234', '3456', '13']}

或者如果您在 input 和 output 之间添加了额外的换行符，如下所示：

Interface 1/2
    input : 7123
    output : 2345

    dropped : 31

正则表达式 \s* 会处理额外的 space 但非正则表达式解析会假设 lines[index+i].split(':') 有一个索引 [1] 所以它会引发一个该数据的 IndexError

或者，如果您在任何行的末尾添加了一些额外的 space，那么正则表达式将无法在内容之后看到换行符，并且 re.match(n, lock) 将 return None 所以下一行会引发一个 AttributeError: 'NoneType' object has no attribute 'group'

或者，如果您将其中一个条目的 Interface 更改为 interface（不再是大写 I），则正则表达式会引发与上述相同的错误，但非正则表达式只会忽略该条目。

当我测试它时，我发现正则表达式更容易搞乱样本 message 的小编辑，但我也发现我使用生成器表达式和 str.partition 比他们两个强得多：

def readline3():
    gen_lines = (line for line in self.output.splitlines()
                        if line and not line.isspace())
    try:
        while True: #ended when next() throws a StopIteration
            start,_,key = next(gen_lines).partition(" ")
            if start == "Interface":
                IFlist[key] = [next(gen_lines).rpartition(" : ")[2]
                                for _ in "123"]
    except StopIteration: # reached end of output
        return self.IFlist

这在上面提到的每一个案例和其他几个案例中都成功了，因为它依赖的唯一方法是 str.partition，它总是 returns 一个 3 项元组，没有什么可以引发任何意想不到的错误，除非 self.output 不是字符串。

另外运行使用 timeit 的基准测试，您的 readline1 始终比 readline2 快，而我的 readline3 通常比 readline1:

#using the default 1000000 loops using 'message'
<function readline1 at 0x100756f28>
11.225649802014232
<function readline2 at 0x1057e3950>
14.838601427007234
<function readline3 at 0x1057e39d8>
11.693351223017089

用于文本处理的 Regex 与 readline

Regex vs readline for text processing

python

regex

dictionary

readline