从一个非常大的文件中删除一个大的多行字符串

Removing a large multi-line string from a very large file

我有一个 10 GB 的文本文件,我想从中查找并删除多行块。这个块作为另一个 10 MB 的文本文件给出,构成一个有争议的部分,在大文件中出现一次并跨越完整的行。假设我没有足够的内存来处理整个 10 GB 的内存,那么在某些脚本语言中最简单的方法是什么?

示例:

big.txt:

...

I have a 10 GB text file, from which I want to find and delete a multi-line chunk.

This chunk is given as another 10 MB text file,

constituting a contentious section appearing once in the large file and spanning complete lines.

Assuming I do not have enough memory to process the whole 10 GB in memory,

what would be the easiest way to do so in some scripting language?

...

chunk.txt:

This chunk is given as another 10 MB text file,

constituting a contentious section appearing once in the large file and spanning complete lines.

result.txt:

...

I have a 10 GB text file, from which I want to find and delete a multi-line chunk.

Assuming I do not have enough memory to process the whole 10 GB in memory,

what would be the easiest way to do so in some scripting language?

...

在这个 comment 之后,我实现了一个 python 脚本来使用 mmap 解决我的问题,它也适用于更一般的条件:

  • 不需要完整的行
  • 处理多个非重叠匹配项
  • 通过减小文件大小来处理多个块文件
  • 使用字节
  • 块本身可以很大

代码:

"""Usage: python3 delchunk.py BIGFILE CHUNK_FILE_OR_FOLDER [OUTFILE]
Given a large file BIGFILE, delete all complete non-overlapping possibly large chunks given by CHUNK_FILE_OR_FOLDER
Multiple chunks will be deleted from the largest to the smallest
If OUTFILE is not given, result will be saved to BIGFILE.delchunk
"""


import mmap
import os
import shutil
import sys


if len(sys.argv) < 3:
    print(__doc__)
    sys.exit(1)
output = sys.argv[3] if len(sys.argv) > 3 else sys.argv[1] + '.delchunk'
if sys.argv[1] != output:
    shutil.copy(sys.argv[1], output)
if os.path.isdir(sys.argv[2]):
    chunks = sorted([os.path.join(sys.argv[2], chunk) for chunk in os.listdir(sys.argv[2]) if os.path.isfile(os.path.join(sys.argv[2], chunk))], key=os.path.getsize, reverse=True)
else:
    chunks = [sys.argv[2]]
with open(output, 'r+b') as bigfile, mmap.mmap(bigfile.fileno(), 0) as bigmap:
    for chunk in chunks:
        with open(chunk, 'rb') as chunkfile, mmap.mmap(chunkfile.fileno(), 0, access=mmap.ACCESS_READ) as chunkmap:
            i = 0
            while True:
                start = bigmap.rfind(chunkmap)
                if start == -1:
                    break
                i += 1
                end = start + len(chunkmap)
                print('Deleting chunk %s (%d) at %d:%d' % (chunk, i, start, end))
                bigmap.move(start, end, len(bigmap) - end)
                bigmap.resize(len(bigmap) - len(chunkmap))
            if not i:
                print('Chunk %s not found' % chunk)
            else:
                bigmap.flush()

https://gist.github.com/eyaler/971efea29648af023e21902b9fa56f08