如何将 lxml.etree._ElementTree 的列表保存到文件

How to save a list of lxml.etree._ElementTree to file

我 运行 遇到了一个令人讨厌的 lxml 库问题,不知道如何解决它。

我有一个 lxml.etree._ElementTree 树的列表和一个属于这些树的 lxml.html.HtmlElement 的列表,并将对应的路径存储在名为路径的列表中

element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees,paths)]
print(element_found.count(False)) # == 0

当我尝试保存路径和树以便稍后检索此状态时,问题就出现了:

trees_to_save = [{'tree': lxml.etree.tostring(tree, pretty_print=True)} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('trees.csv')

EncodeForamt = lxml.html.HTMLParser(encoding='utf-8')

trees_from_file = pd.read_csv('trees.csv')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: etree.HTML(literal_eval(x),EncodeForamt).getroottree())

那么同样的测试是运行:

element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees_from_file,paths)]
print(element_found.count(False)) # == 6 (out of 12k)

通常我会尝试完成找到的所有路径,显然 to/from 字符串方法和我如何保存树木都存在问题。我在 lxml 库中尝试了各种方法,例如 tree.write 而不是字符串,而不是 literal_eval 只是 .encode('utf-8') 无济于事,有和没有 pretty_print,尝试了 etree.from_string() 也得到了同样的结果...

令人担忧的是,这还会引发 XML 语法错误:

trees = [etree.fromstring(etree.tostring(t)) for t in trees]

我有点不知道如何正确保存这些树...

好的,在尝试了我能找到的一切之后,我想出了如何完成这项工作,需要使用解析而不是 tostring:

trees_to_save = [{'tree': lxml.etree.tostring(tree,encoding='utf-8',method='html')} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('location_trees.csv')

trees_from_file = pd.read_csv('location_trees.csv')
EncodeForamt = lxml.etree.HTMLParser(encoding='utf-8')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: lxml.etree.parse(x,parser=EncodeForamt))