优化 python 代码以读取文件

Question

我有以下代码：代码 1:

logfile = open(logfile, 'r')
logdata = logfile.read()
logfile.close()
CurBeginA = BeginSearchDVar
CurEndinA = EndinSearchDVar
matchesBegin = re.search(str(BeginTimeFirstEpoch), logdata)
matchesEnd = re.search(str(EndinTimeFirstEpoch), logdata)
BeginSearchDVar = BeginTimeFirstEpoch
EndinSearchDVar = EndinTimeFirstEpoch

我在脚本的另一部分也有这段代码：代码 2

TheTimeStamps = [ x.split(' ')[0][1:-1] for x in open(logfile).readlines() ]

这里很清楚，我加载了两次日志文件。我想避免这种情况。无论如何，我可以在代码 2 和代码 1 中做我正在做的事情吗？那么，那样的话，日志文件只加载一次？

在代码 1 中，我正在搜索日志以确保在其中的不同行中找到两个非常具体的模式。

在代码 2 中，我只提取日志文件中所有行的第一列。

如何更好地优化？我运行这是一个当前大小为 480MB 的日志文件，脚本在大约 12 秒内完成。考虑到此日志的大小可达 1GB 甚至 2GB，我希望尽可能提高效率。

更新：

所以来自@abernert 的代码有效。我继续向它添加了一个额外的逻辑，现在它不再起作用了。下面是我现在修改后的代码。我基本上在这里做的是，如果在日志中找到了 matchesBegin 和 matchesEnd 中的模式，那么，从 matchesBegin 到 matchesEnd 搜索日志并仅打印出包含 stringA 和 stringB 的行：

        matchesBegin, matchesEnd = None, None
        beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
        AllTimeStamps = []
        mylist = []
        with open(logfile, 'rb') as input_data:
            def SearchFirst():
                matchesBegin, matchesEnd = None, None
                for line in input_data:
                    if not matchesBegin:
                        matchesBegin = beginStr in line
                    if not matchesEnd:
                        matchesEnd = endStr in line
                return(matchesBegin, matchesEnd)
            matchesBegin, matchesEndin = SearchFirst()
            #print type(matchesBegin)
            #print type(matchesEndin)
            #if str(matchesBegin) == "True" and str(matchesEnd) == "True":
            if matchesBegin is True and matchesEndin is True:
                rangelines = 0
                for line in input_data:
                    print line
                    if beginStr in line[0:25]:  # Or whatever test is needed
                        rangelines += 1
                        #print line.strip()
                        if re.search(stringA, line) and re.search(stringB, line):
                            mylist.append((line.strip()))
                        break
                for line in input_data:  # This keeps reading the file
                    print line
                    if endStr in line[0:25]:
                        rangelines += 1
                        if re.search(stringA, line) and re.search(stringB, line):
                            mylist.append((line.strip()))
                        break
                    if re.search(stringA, line) and re.search(stringB, line):
                        rangelines += 1
                        mylist.append((line.strip()))
                    else:
                        rangelines += 1
                #return(mylist,rangelines)
                    print(mylist,rangelines)
                    AllTimeStamps.append(line.split(' ')[0][1:-1])

上面的代码我做错了什么？

Answer 1

首先，几乎没有理由调用 readlines()。一个文件已经是一个可迭代的行，所以你可以只遍历文件；将所有这些行读入内存并构建一个巨大的列表只会浪费时间和内存。

另一方面，调用 read() 有时很有用。它必须将整个内容作为一个巨大的字符串读入内存，但是对一个巨大的字符串进行正则表达式搜索可以加快速度，与逐行搜索相比，浪费了时间和 space 得到了更多的补偿。

但是如果你想将它减少到对文件的单次传递，因为你已经必须逐行迭代，所以除了逐行进行正则表达式搜索之外别无选择。这应该工作（你没有显示你的模式，但根据名称，我猜测它们不会跨越线边界，也不是 multiline 或 dotall模式），但它实际上是更快还是更慢将取决于各种因素。

无论如何，这当然值得一试，看看是否有帮助。（而且，当我们这样做时，我将使用 with 语句来确保您关闭文件，而不是像在第二部分中那样泄漏它。）

CurBeginA = BeginSearchDVar
CurEndinA = EndinSearchDVar
BeginSearchDVar = BeginTimeFirstEpoch
EndinSearchDVar = EndinTimeFirstEpoch    
matchesBegin, matchesEnd = None, None
TheTimeStamps = []
with open(logfile) as f:
    for line in f:
        if not matchesBegin:
            matchesBegin = re.search(str(BeginTimeFirstEpoch), line)
        if not matchesEnd:
            matchesEnd = re.search(str(EndinTimeFirstEpoch), line)
        TheTimeStamps.append(line.split(' ')[0][1:-1])

您可以在此处进行一些其他小的更改，这可能会有所帮助。

我不知道 BeginTimeFirstEpoch 是什么，但您使用 str(BeginTimeFirstEpoch) 的事实表明它根本不是正则表达式模式，而是类似于 datetime对象还是 int？而且您并不真的需要匹配对象，您只需要知道是否有匹配项？如果是这样，您可以删除 regex 并进行简单的子字符串搜索，这样会更快一些：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch), str(EndinTimeFirstEpoch)
with …
    # …
    if not matchesBegin:
        matchesBegin = beginStr in line
    if not matchesEnd:
        matchesEnd = endStr in line

如果您的搜索字符串和时间戳等都是纯 ASCII，以二进制模式处理文件可能会更快，只解码您需要存储的位，而不是所有内容：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
with open(logFile, 'rb') as f:
    # …
    if not matchesBegin:
        matchesBegin = beginStr in line
    if not matchesEnd:
        matchesEnd = endStr in line
    TheTimeStamps.append(line.split(b' ')[0][1:-1].decode())

最后，我怀疑 str.split 是否接近代码瓶颈，但是，以防万一……当我们只想要第一个拆分时，为什么还要在所有 space 上拆分？

TheTimeStamps.append(line.split(b' ', 1)[0][1:-1].decode())

优化 python 代码以读取文件

optimize python code to read file

python

optimization

file-handling