Python 限制 readlines() 的换行符
Python restrict newline characters for readlines()
我正在尝试拆分混合使用换行符 LF
、CRLF
和 NEL
的文本。我需要最好的方法来将 NEL
个角色排除在场景之外。
是否有一个选项可以指示 readlines()
在拆分行时排除 NEL?我可以 read()
并只匹配 LF
和 CRLF
循环中的分割点。
有没有更好的解决方案?
我用 codecs.open()
打开文件以打开 utf-8
文本文件。
并且在使用 readlines()
时,它 确实 在 NEL 字符处拆分:
文件内容为:
"u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'"
file.readlines()
只会在 \n
、\r
或 \r\n
上拆分,具体取决于 OS 以及是否启用了通用换行符支持。
U+0085 NEXT LINE (NEL) 在该上下文中未被识别为换行符,您无需执行任何特殊操作即可让 file.readlines()
忽略它。
引用 open()
function documentation:
Python is usually built with universal newlines support; supplying 'U'
opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n'
, the Macintosh convention '\r'
, or the Windows convention '\r\n'
. All of these external representations are seen as '\n'
by the Python program. If Python is built without universal newlines support a mode with 'U'
is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n'
, '\r'
, '\r\n'
, or a tuple containing all the newline types seen.
和 universal newlines glossary entry:
A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n'
, the Windows convention '\r\n'
, and the old Macintosh convention '\r'
. See PEP 278 and PEP 3116, as well as str.splitlines()
for an additional use.
不幸的是,codecs.open()
打破了这个规则; documentation 含糊地暗示了被询问的特定编解码器:
Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.
而不是codecs.open()
,使用io.open()
以正确的编码打开文件,然后逐行处理:
with io.open(filename, encoding=correct_encoding) as f:
lines = f.open()
io
是新的 I/O 基础设施,完全取代 Python 2 系统 Python 3。它只处理 \n
、\r
和 \r\n
:
>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']
codecs.open()
结果是由于代码使用了 str.splitlines()
being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue)。此方法的文档没有对此进行解释;它声称仅根据通用换行符规则拆分。
import re
f = [re.sub(' \r ', '', str(line)) for line in open('file.csv', 'rb')]
将创建一个字符串列表,该列表将忽略其他 \r
个字符。列表中的每个元素都是文件中的一行。我有一个类似的问题,这对我的 csv 有效。您可能需要更改 re.sub
部分中的正则表达式以满足您的需要。
注意: 这将删除 \r
字符并将其替换为 ''
。我想摆脱它们,所以它对我有用。
我正在尝试拆分混合使用换行符 LF
、CRLF
和 NEL
的文本。我需要最好的方法来将 NEL
个角色排除在场景之外。
是否有一个选项可以指示 readlines()
在拆分行时排除 NEL?我可以 read()
并只匹配 LF
和 CRLF
循环中的分割点。
有没有更好的解决方案?
我用 codecs.open()
打开文件以打开 utf-8
文本文件。
并且在使用 readlines()
时,它 确实 在 NEL 字符处拆分:
文件内容为:
"u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'"
file.readlines()
只会在 \n
、\r
或 \r\n
上拆分,具体取决于 OS 以及是否启用了通用换行符支持。
U+0085 NEXT LINE (NEL) 在该上下文中未被识别为换行符,您无需执行任何特殊操作即可让 file.readlines()
忽略它。
引用 open()
function documentation:
Python is usually built with universal newlines support; supplying
'U'
opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention'\n'
, the Macintosh convention'\r'
, or the Windows convention'\r\n'
. All of these external representations are seen as'\n'
by the Python program. If Python is built without universal newlines support a mode with'U'
is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen),'\n'
,'\r'
,'\r\n'
, or a tuple containing all the newline types seen.
和 universal newlines glossary entry:
A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention
'\n'
, the Windows convention'\r\n'
, and the old Macintosh convention'\r'
. See PEP 278 and PEP 3116, as well asstr.splitlines()
for an additional use.
不幸的是,codecs.open()
打破了这个规则; documentation 含糊地暗示了被询问的特定编解码器:
Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.
而不是codecs.open()
,使用io.open()
以正确的编码打开文件,然后逐行处理:
with io.open(filename, encoding=correct_encoding) as f:
lines = f.open()
io
是新的 I/O 基础设施,完全取代 Python 2 系统 Python 3。它只处理 \n
、\r
和 \r\n
:
>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']
codecs.open()
结果是由于代码使用了 str.splitlines()
being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue)。此方法的文档没有对此进行解释;它声称仅根据通用换行符规则拆分。
import re
f = [re.sub(' \r ', '', str(line)) for line in open('file.csv', 'rb')]
将创建一个字符串列表,该列表将忽略其他 \r
个字符。列表中的每个元素都是文件中的一行。我有一个类似的问题,这对我的 csv 有效。您可能需要更改 re.sub
部分中的正则表达式以满足您的需要。
注意: 这将删除 \r
字符并将其替换为 ''
。我想摆脱它们,所以它对我有用。