splitlines() 和遍历打开的文件给出不同的结果

Question

我的文件有时会带有奇怪的行尾字符，例如 \r\r\n。有了这个，它就像我想要的那样工作：

with open('test.txt', 'wb') as f:  # simulate a file with weird end-of-lines
    f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
    for l in f:
        print(l)
# b'abc\r\r\n'         
# b'def'

我希望能够从一个字符串中得到相同的结果。我考虑过 splitlines 但结果不一样：

print(b'abc\r\r\ndef'.splitlines()) # [b'abc', b'', b'def']

即使 keepends=True，结果也不一样。

问题：如何使 for l in f 与 splitlines() 具有相同的行为？

已链接：Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232

注意：我不想将所有内容都放在 BytesIO 或 StringIO 中，因为它的速度性能为 x0.5（已通过基准测试）；我想保留一个简单的字符串。所以它不是 How do I wrap a string in a file in Python?.
的副本

Answer 1

我会这样遍历：

text  = "b'abc\r\r\ndef'"

results = text.split('\r\r\n')

for r in results:
    print(r)

Answer 2

这是一个for l in f:解决方案：

关键是 open 调用中的 newline 参数。来自文档：

[![在此处输入图片描述][1]][1]

因此，您应该在写作时使用 newline='' 来抑制换行符转换，然后在阅读时使用 newline='\n'，如果您的所有行都以 0 个或更多 '\r' 字符结尾，这将起作用后跟一个 '\n' 字符：

with open('test.txt', 'w', newline='') as f:
    f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
    for line in f:
        print(repr(line))

打印：

'abc\r\r\n'
'def'

一个准分割线解法：

严格来说，这不是 splitlines 解决方案，因为要能够处理任意行结尾，必须使用 split 的正则表达式版本捕获行结尾，然后重新组装线和他们的结局。因此，此解决方案仅使用正则表达式来分解输入文本，允许行尾由任意数量的 '\r' 字符后跟 '\n' 字符组成：

import re

input = '\nabc\r\r\ndef\nghi\r\njkl'

with open('test.txt', 'w', newline='') as f:
    f.write(input)
with open('test.txt', 'r', newline='') as f:
    text = f.read()
    lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
    for line in lines:
        print(repr(line))

打印：

'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'

Regex Demo

Answer 3

有几种方法可以做到这一点，但是 none 特别快。

如果你想保留行尾，你可以尝试 re 模块：

lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)

如果您需要结局并且文件非常大，您可能需要迭代：

for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
    line = r.group()
    # do stuff with line here

如果你不需要结局，那么你可以更容易地做到：

lines = list(filter(None, text.splitlines()))

如果您只是迭代结果（或者如果使用 Python2），您可以省略 list() 部分：

for line in filter(None, text.splitlines()):
    pass # do stuff with line

Answer 4

你为什么不把它分开:

input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n') 
print(result)

[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']

如果您确实需要的话，您将丢失尾随的 \n，稍后可以将其添加到每一行。在最后一行需要检查它是否真的需要。喜欢

fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
    fixed[-1] = fixed[-1][:-1]
print(fixed)

[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']

另一种带有生成器的变体。这样它将在大文件上精通内存并且语法将类似于原始 for l in bin_split(input) :

def bin_split(input_str):
    start = 0
    while start>=0 :
        found = input_str.find(b'\n', start) + 1
        if 0 < found < len(input_str):
            yield input_str[start : found]
            start = found
        else:
            yield input_str[start:]
            break

splitlines() 和遍历打开的文件给出不同的结果

splitlines() and iterating over an opened file give different results

python

split

end-of-line