Python 限制 readlines() 的换行符

Python restrict newline characters for readlines()

我正在尝试拆分混合使用换行符 LFCRLFNEL 的文本。我需要最好的方法来将 NEL 个角色排除在场景之外。

是否有一个选项可以指示 readlines() 在拆分行时排除 NEL?我可以 read() 并只匹配 LFCRLF 循环中的分割点。

有没有更好的解决方案?

我用 codecs.open() 打开文件以打开 utf-8 文本文件。

并且在使用 readlines() 时,它 确实 在 NEL 字符处拆分:

文件内容为:

"u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'"

file.readlines() 只会在 \n\r\r\n 上拆分,具体取决于 OS 以及是否启用了通用换行符支持。

U+0085 NEXT LINE (NEL) 在该上下文中未被识别为换行符,您无需执行任何特殊操作即可让 file.readlines() 忽略它。

引用 open() function documentation:

Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

universal newlines glossary entry:

A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use.

不幸的是,codecs.open()打破了这个规则; documentation 含糊地暗示了被询问的特定编解码器:

Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.

而不是codecs.open(),使用io.open()以正确的编码打开文件,然后逐行处理:

with io.open(filename, encoding=correct_encoding) as f:
    lines = f.open()

io 是新的 I/O 基础设施,完全取代 Python 2 系统 Python 3。它只处理 \n\r\r\n:

>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']

codecs.open() 结果是由于代码使用了 str.splitlines() being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue)。此方法的文档没有对此进行解释;它声称仅根据通用换行符规则拆分。

import re

f = [re.sub(' \r ', '', str(line)) for line in open('file.csv', 'rb')]

将创建一个字符串列表,该列表将忽略其他 \r 个字符。列表中的每个元素都是文件中的一行。我有一个类似的问题,这对我的 csv 有效。您可能需要更改 re.sub 部分中的正则表达式以满足您的需要。

注意: 这将删除 \r 字符并将其替换为 ''。我想摆脱它们,所以它对我有用。