HTML解析器和 BeautifulSoup 未正确解码 HTML 个实体

Question

我正在尝试使用 HTMLParser 和 BeautifulSoup

从 HTML 源代码的一部分解码 HTML entities

然而，两者似乎都无法完全发挥作用。即他们不解码斜杠。

我的 Python 版本是 2.7.11 和 BeautifulSoup 版本 3.2.1

print 'ORIGINAL STRING: %s \n' % original_url_string

#clean up
try:
    # Python 2.6-2.7
    from HTMLParser import HTMLParser
except ImportError:
    # Python 3
    from html.parser import HTMLParser

h = HTMLParser()
url_string = h.unescape(original_url_string)

print 'CLEANED WITH html.parser: %s \n' % url_string

decoded = BeautifulSoup( original_url_string,convertEntities=BeautifulSoup.HTML_ENTITIES)

print 'CLEANED WITH BeautifulSoup: %s \n' % decoded.contents

给我这样的输出：

ORIGINAL STRING: api.soundcloud.com%2Ftracks%2F277561480&#038;show_artwork=true&#038;maxwidth=1050&#038;maxheight=1000 

CLEANED WITH html.parser: api.soundcloud.com%2Ftracks%2F277561480&show_artwork=true&maxwidth=1050&maxheight=1000 

CLEANED WITH BeautifulSoup: [u'api.soundcloud.com%2Ftracks%2F277561480&show_artwork=true&maxwidth=1050&maxheight=1000']

我在这里错过了什么？

我是否应该在提取 url 之前尝试解码整个 HTML 页面？

Python有更好的方法吗？

Answer 1

您是否尝试解码 url 或 url 的 html 中的斜线？

如果您尝试解码斜杠，它们不是 HTML entities，而是百分号编码字符。

urllib有你需要的方法：

import urllib
urllib.unquote(original_url_string)
>>> 'api.soundcloud.com/tracks/277561480&#038;show_artwork=true&#038;maxwidth=1050&#038;maxheight=1000'

如果你想解码 html，你首先必须 get 用像 requests 或 urllib

这样的包来解码它

HTML解析器和 BeautifulSoup 未正确解码 HTML 个实体

HTMLParser and BeautifulSoup not decoding HTML entities correctly

python

beautifulsoup

html-entities

html-parser