Python 中的 MemoryError 通过使用 mmap 和 re.findall 搜索大文件

Question

我希望使用 re 实现几行 python，首先操作一个字符串，然后将该字符串用作正则表达式搜索。我有中间带有 * 的字符串，即 ab***cd，* 是任意长度。这样做的目的是在文档中进行正则表达式搜索，以提取与起始字符和结束字符匹配的任何行，以及介于两者之间的任意数量的字符。即 ab12345cd、abbbcd、ab_fghfghfghcd 都是正匹配。负匹配示例：1abcd、agcd、bb111cd。

我想出了 [\s\S]*? 的正则表达式来代替 * 的输入。所以我想从 ab***cd 到 ^ab[\s\S]*?cd 的示例字符串中获取，然后我将使用它来对文档进行正则表达式搜索。

然后我想在 mmap 中打开文件，使用正则表达式搜索它，然后将匹配项保存到文件中。

import re
import mmap 

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def searchFile(list_txt, raw_str):
    search="^"+raw_str #add regex ^ newline operator
    search_rgx=re.sub(r'\*+',r'[\s\S]*?',search) #replace * with regex function

    #search file
    with open(list_txt, 'r+') as f: 
        data = mmap.mmap(f.fileno(), 0)
        results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)

    #save results
    f1 = open('results.txt', 'w+b')
    results_bin = b'\n'.join(results)
    f1.write(results_bin)
    f1.close()

    print("Found "+str(file_len("results.txt"))+" results")

searchFile("largelist.txt","ab**cd")

现在这可以很好地处理一个小文件。但是，当文件变大（1gb 的文本）时，我收到此错误：

Traceback (most recent call last):
  File "c:\Programming\test.py", line 27, in <module>
    searchFile("largelist.txt","ab**cd")
  File "c:\Programming\test.py", line 21, in searchFile
    results_bin = b'\n'.join(results)
MemoryError

首先 - 谁能帮助稍微优化一下代码？我做错了什么严重的事吗？我使用 mmap 是因为我知道我想查看大文件并且我想逐行读取文件而不是一次全部读取（因此有人建议使用 mmap）。

我还被告知要查看 pandas 库以获得更多数据操作。熊猫会取代mmap吗？

感谢您的帮助。如您所知，我是 python 的新手 - 感谢您的帮助。

Answer 1

这个怎么样？在这种情况下，您需要的是一个用字符串表示的所有行的列表。以下模拟，产生一个字符串列表：

import io

longstring = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""

list_of_strings = io.StringIO(longstring).read().splitlines()
list_of_strings

输出

['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']

这是重要的部分

s = pd.Series(list_of_strings)
s[s.str.match('^ab[\s\S]*?cd')]

输出

0         ab12345cd
1            abbbcd
2    ab_fghfghfghcd
dtype: object

Edit2：试试这个：（我看不出你想把它作为一个函数的原因，但自从你在评论中所做的那样我就这样做了。）

def newsearch(filename):
    with open(filename, 'r', encoding="utf-8") as f:
        list_of_strings = f.read().splitlines()
    s = pd.Series(list_of_strings)
    s = s[s.str.match('^ab[\s\S]*?cd')]
    s.to_csv('output.txt', header=False, index=False)

newsearch('list.txt')

基于块的方法

import os

def newsearch(filename):
    outpath = 'output.txt'
    if os.path.exists(outpath):
        os.remove(outpath)
    for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
        chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
        chunk[0].to_csv(outpath, index=False, header=False, mode='a')

newsearch('list.txt')

一个愚蠢的方法

import dask.dataframe as dd

def newsearch(filename):
    chunk = dd.read_csv(filename, header=None, blocksize=25e6)
    chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
    chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)

newsearch('list.txt')

Answer 2

您正在逐行处理，因此您希望避免在内存中累积数据。常规文件读写应该在这里运行良好。 mmap 由虚拟内存支持，但在您阅读它时必须将其转换为实际内存。在 findall 中累积结果也是一个内存消耗。试试这个作为替代：

import re

# buffer to 1Meg but any effect would be modest
MEG = 2**20

def searchFile(filename, raw_str):
    # extract start and end from "ab***cd"
    startswith, endswith = re.match(r"([^\*]+)\*+?([^\*]+)", raw_str).groups()
    with open(filename, buffering=MEG) as in_f, open("results.txt", "w", buffering=MEG) as out_f:
        for line in in_f:
            stripped = line.strip()
            if stripped.startswith(startswith) and stripped.endswith(endswith):
                out_f.write(line)

# write test file

test_txt = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""

want = """ab12345cd
abbbcd
ab_fghfghfghcd
"""

open("test.txt", "w").write(test_txt)

searchFile("test.txt", "ab**cd")

result = open("results.txt").read()
print(result == want)

Answer 3

我不确定您认为使用 mmap 打开输入文件会有什么好处，但是由于必须匹配的每个字符串都由换行符分隔（根据您的评论），我会使用下面的方法（注意它是 Python，但故意保留为伪代码）：

with open(input_file_path, "r") as input_file:
  with open(output_file_path, "x" as output_file:
    for line in input_file:
      if is_match(line):
        print(line, file=output_file)

可能根据需要调整 print 函数的 endline 参数。

通过这种方式，结果在生成时即被写入，并且您可以避免在写入之前在内存中占用大量 results。此外，您无需关注换行符。只看每行是否匹配。

Python 中的 MemoryError 通过使用 mmap 和 re.findall 搜索大文件

MemoryError in Python by searching a large file using mmap and re.findall

python

mmap

large-files

pandas

python-re