无法解析 stackexchange XML 个文件
Failing parsing stackexchange XML files
我正在尝试从 stack exchange dump 解析 PostHistory.xml 文件。我的代码看起来像这样:
import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
xml_tree = eTree.parse(xml_file)
但我得到:
UnicodeDecodeError: 'utf-8' codec can't decode
bytes in position 1959-1960: invalid continuation byte
我可以像这样阅读文件的文本:
with open("PostHistory.xml") as xml_file:
a = xml_file.readline()
文件*命令returns文件的描述:
PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text,
with very long lines, with CRLF line terminators
文件的第一行也确认了 UTF-8 编码:
<?xml version="1.0" encoding="utf-8"?>
我尝试添加参数encoding="utf-8-sig"
,但我又遇到同样的错误。
文件大小为 112 Gb。
我在这里遗漏了什么吗?
文件字节的实际情况可能与 XML 声明中指定的编码相矛盾。 (仅在 XML 声明中设置编码不会更改文件中的其余字节。)
你可以试试
open("PostHistory.xml", 'r', encoding="ISO-8859-1")
但如果是数据损坏而不是文件范围的编码问题,您可能需要卷起袖子修复 1959-1960
处的错误字节。
另请参阅:
- UnicodeDecodeError: 'utf-8' codec can't decode byte
您可以尝试这样的操作:
with open(posts_path) as xml_file:
for line in xml_file:
try:
xml_obj = eTree.fromstring(line)
except UnicodeDecodeError as e:
# Dealing with corrupted encoded strings
new_str = line.encode("latin-1", "ignore")
xml_obj1 = eTree.fromstring(ww)
所以当你得到无效字符时,你会将它们编码为 "latin-1"
我正在尝试从 stack exchange dump 解析 PostHistory.xml 文件。我的代码看起来像这样:
import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
xml_tree = eTree.parse(xml_file)
但我得到:
UnicodeDecodeError: 'utf-8' codec can't decode
bytes in position 1959-1960: invalid continuation byte
我可以像这样阅读文件的文本:
with open("PostHistory.xml") as xml_file:
a = xml_file.readline()
文件*命令returns文件的描述:
PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text,
with very long lines, with CRLF line terminators
文件的第一行也确认了 UTF-8 编码:
<?xml version="1.0" encoding="utf-8"?>
我尝试添加参数encoding="utf-8-sig"
,但我又遇到同样的错误。
文件大小为 112 Gb。 我在这里遗漏了什么吗?
文件字节的实际情况可能与 XML 声明中指定的编码相矛盾。 (仅在 XML 声明中设置编码不会更改文件中的其余字节。)
你可以试试
open("PostHistory.xml", 'r', encoding="ISO-8859-1")
但如果是数据损坏而不是文件范围的编码问题,您可能需要卷起袖子修复 1959-1960
处的错误字节。
另请参阅:
- UnicodeDecodeError: 'utf-8' codec can't decode byte
您可以尝试这样的操作:
with open(posts_path) as xml_file:
for line in xml_file:
try:
xml_obj = eTree.fromstring(line)
except UnicodeDecodeError as e:
# Dealing with corrupted encoded strings
new_str = line.encode("latin-1", "ignore")
xml_obj1 = eTree.fromstring(ww)
所以当你得到无效字符时,你会将它们编码为 "latin-1"