在 Python 2 中获得类似 Python 3 html unescape 的行为

Question

所以我遇到了一个问题，似乎 Python 2 (2.7.13) 根本没有为转义所有实体定义所需的 HTML 实体。

例如当运行这个脚本时：

# test_unescape.py

from six.moves.html_parser import HTMLParser
h = HTMLParser()
# Print in a tuple for clarity
print((h.unescape('&pound;&lt;&Tab;&NewLine;&Colon;'),))

根据 Python 版本，您会得到不同的结果。

Python 2:

$ python test_unescape.py
(u'\xa3<&Tab;&NewLine;&Colon;',)

制表符、换行符和冒号被转义

Python 3:

$ python3 test_unescape.py
('£<\t\n∷',)

全部未转义

我也不清楚为什么 Python 3 示例中有两个冒号。

获得 Python 3 版本或 Python 2 中的等效版本而无需手动定义所有丢失的实体（因此必须与未来的实体一起维护它......）的任何解决方法都是非常感谢

Answer 1

我在 this library

中找到了问题的答案

更具体地说，位于 html5lib.constants.entities

的完整实体列表中

import html5lib

def unescape_custom(s):
    ents = html5lib.constants.entities
    for e in ents:
        if e[-1] != ';':
            e += ';'
        s = s.replace('&{}'.format(e), ents[e])
    return s

结果：

# Python 2
>>> print((unescape_custom("&pound;&lt;&Tab;&NewLine;&Colon;"),))
(u'\xa3<\t\n\u2237',)
# Python 3
>>> print((unescape_custom("&pound;&lt;&Tab;&NewLine;&Colon;"),))
('£<\t\n∷',)

在 Python 2 中获得类似 Python 3 html unescape 的行为

Getting behaviour like Python 3 html unescape in Python 2

html

python

escaping

html-entities

python-3.x