Python 中 apache 日志解析器的执行时间更短

Question

我有一项学校作业，我的任务是在 Python 中编写 apache 日志解析器。此解析器将使用正则表达式提取所有 IP 地址和所有 HTTP 方法，并将它们存储在嵌套字典中。代码如下：

def aggregatelog(filename):
    keyvaluepairscounter = {"IP":{}, "HTTP":{}}
    with open(filename, "r") as file:
        for line in file:
                result = search(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z]+\b)', line).groups() #Combines the regexes: IP (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) and HTTP Method ("(\b[A-Z]+\b))
                if result[0] in set(keyvaluepairscounter["IP"].keys()): #Using set will lower look up time complexity from O(n) to O(1)
                    keyvaluepairscounter["IP"][result[0]] += 1
                else:
                    keyvaluepairscounter["IP"][result[0]] = 1
                
                if result[1] in set(keyvaluepairscounter["HTTP"].keys()):
                    keyvaluepairscounter["HTTP"][result[1]] += 1
                else:
                    keyvaluepairscounter["HTTP"][result[1]] = 1

    return keyvaluepairscounter

此代码有效（它为我提供了给定日志文件的预期数据）。但是，当从大型日志文件（在我的例子中，大约 500 MB）中提取数据时，程序非常慢（脚本完成需要大约 30 分钟）。据我的老师说，一个好的脚本应该能够在 3 分钟内处理大文件（wth？）。我的问题是：我可以做些什么来加快脚本的速度吗？我做了一些事情，比如用具有更好查找时间的集合替换列表。

Answer 1

至少，pre-compile 循环之前的正则表达式，即

pattern = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*.*"(\b[A-Z]+\b)')

然后在你的循环中：

for line in file:
                result = search(pattern, line).groups()

您还应该考虑优化您的模式，尤其是 .*，因为它是一项昂贵的操作

Answer 2

我找到了答案。使用“re.findall()”而不是像这样将返回的正则表达式数据存储在数组中：

for data in re.findall(pattern, text):
    do things

而不是

array = re.findall(pattern, text)
for data in array:
    do things

我也是一口气看完了整个文件：

file = open("file", "r")
text = file.read()

此实现在不到 1 分钟的时间内处理了文件！

Python 中 apache 日志解析器的执行时间更短

Lower execution time for apache log parser in Python

python

performance