无法解析 stackexchange XML 个文件

Question

我正在尝试从 stack exchange dump 解析 PostHistory.xml 文件。我的代码看起来像这样：

import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
    xml_tree = eTree.parse(xml_file)

但我得到：

UnicodeDecodeError: 'utf-8' codec can't decode 
bytes in position 1959-1960: invalid continuation byte

我可以像这样阅读文件的文本：

with open("PostHistory.xml") as xml_file:
     a = xml_file.readline()

文件*命令returns文件的描述：

PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, 
with very long lines, with CRLF line terminators

文件的第一行也确认了 UTF-8 编码：

<?xml version="1.0" encoding="utf-8"?>

我尝试添加参数encoding="utf-8-sig"，但我又遇到同样的错误。

文件大小为 112 Gb。我在这里遗漏了什么吗？

Answer 1

文件字节的实际情况可能与 XML 声明中指定的编码相矛盾。（仅在 XML 声明中设置编码不会更改文件中的其余字节。）

你可以试试

open("PostHistory.xml", 'r', encoding="ISO-8859-1")

但如果是数据损坏而不是文件范围的编码问题，您可能需要卷起袖子修复 1959-1960 处的错误字节。

另请参阅：

UnicodeDecodeError: 'utf-8' codec can't decode byte

Answer 2

您可以尝试这样的操作：

    with open(posts_path) as xml_file:  
        for line in xml_file:            
            try:                    
                xml_obj = eTree.fromstring(line)                    
            except UnicodeDecodeError as e:
                # Dealing with corrupted encoded strings
                new_str = line.encode("latin-1", "ignore")
                xml_obj1 = eTree.fromstring(ww)

所以当你得到无效字符时，你会将它们编码为 "latin-1"

无法解析 stackexchange XML 个文件

Failing parsing stackexchange XML files

xml

elementtree

xml-parsing

python-3.x