如何使用 python 和 lxml 从 html 属性中获取未解析的实体

Question

当用 python/lxml 解析 HTML 时，我想检索 html 元素的实际属性文本，但我得到的是带有已解析实体的属性文本。也就是说，如果实际属性读 this & that，我回来 this & that。

有没有办法获取未解析的属性值？这是一些显示我的问题的示例代码，使用 python2.7 和 lxml 3.2.1

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

我想得到实际的字符串hi & there。

Answer 1

未转义字符在 HTML 中无效，并且 HTML 抽象模型（在本例中为 lxml.etree）仅适用于有效的 HTML。所以在源 HTML 加载到对象模型后，没有未转义字符的概念。

给定 HTML 源中的未转义字符，解析器要么完全失败，要么尝试自动修复源。 lxml.etree.HTMLParser 似乎属于后一类。演示：

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

而且我相信，HTML树模型不保留有关原始HTML源的信息，而是保留固定有效的信息.所以此时，我们只能看到所有的字符都被转义了。

话虽如此，使用 cgi.escape() 获取转义实体怎么样！ :p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

如何使用 python 和 lxml 从 html 属性中获取未解析的实体

how to get unresolved entities from html attributes using python and lxml

html

python

lxml

python-2.7