为什么 BeautifulSoup 重新格式化我的 XML?
Why does BeautifulSoup reformat my XML?
我执行以下操作:
from BeautifulSoup import *
html = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(html)
soup.contents
结果我得到:
[<body><b>In Body</b><b>Second level</b></body>]
这对我来说很奇怪,因为我没有看到原来的 XML。最初我有一个包含一些文本 (In Body
) 的标签 <b>
,然后它包含另一个标签 <b>
。但是,BeautifulSoup
"thinks" 我有标签 <b>
并且在它之后(关闭后)我有另一个标签 <b>
。因此,标签不会被视为彼此嵌套。这是为什么?
已添加
对于那些抱怨我例子中 HTML 有效性的人,我做了以下例子:
xml = u'<aaa><bbb>In Body<bbb>Second level</bbb></bbb></aaa>'
soup = BeautifulSoup(xml)
soup.contents
哪个returns:
[<aaa><bbb>In Body</bbb><bbb>Second level</bbb></aaa>]
添加 2
如果我使用:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
我得到:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1344, in unknown_starttag
and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
AttributeError: 'list' object has no attribute 'text'
请注意,您使用的是过时的软件包,BeautifulSoup
:
This package is OBSOLETE. It has been replaced by the beautifulsoup4
package. You should use Beautiful Soup 4 for all new projects
BeautifulSoup 3 包含一些 XML 解析功能(BeautifulStoneSoup
),这些功能确实无法理解再次嵌套的相同标签(如 7stud 在 中指出的那样) ; 因此,对于所有 XML 解析需求,它应该完全彻底地被 BeautifulSoup 取代 4. 请注意,这些包甚至可以在应用程序中共存 - BeautifulSoup.BeautifulSoup
对于 BS3,bs4.BeautifulSoup
BS4.
BeautifulSoup 4 默认使用HTML规则解析;您需要明确告诉它使用 XML(需要安装 lxml
)。因此 BeautifulSoup 4 (PyPI beautifulsoup4
) 的示例:
>>> from bs4 import BeautifulSoup
>>> xml = u'<body><b>In Body<b>Second level</b></b></body>'
>>> soup = BeautifulSoup(xml, 'xml')
>>> soup.contents
[<body><b>In Body<b>Second level</b></b></body>]
>>> bs4.__version__
'4.1.3'
注意文档必须格式正确XML;绝不手软
如果您不使用 'xml'
参数,您将得到错误解析的文档:
>>> bs4.BeautifulSoup('<p><p></p></p>')
<html><body><p></p><p></p></body></html>
并与
>>> bs4.BeautifulSoup('<p><p></p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p/></p>
So, the tags are not perceived as nested into each other. Why is that?
根据BeautifulSoup源代码中的注释:
Tag nesting rules:
Most tags can't be nested at all. For instance, the occurance of a
<p>
tag should implicitly close the previous <p>
tag.
<p>Para1<p>Para2
should be transformed into:
<p>Para1</p><p>Para2
然后,源代码指定了几个包含标签名称的列表,根据 HTML 标准,这些标签名称允许嵌套在它们自身内——而 <b>
不是其中之一。
If I use:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
I get:
AttributeError: 'list' object has no attribute 'text'
您收到该错误是因为您无法将列表作为参数传递给 BeautifulSoup()。
为了提醒 BeautifulSoup 您没有解析 html,您需要使用 BeautifulStoneSoup()
。不幸的是,我的测试表明 BeautifulStoneSoup()
产生相同的 xml,因此看来 BeautifulStoneSoup()
将类似的嵌套规则应用于您的 <b>
标记。
如果您没有被限制使用 BeautifulSoup 3
,您应该使用 lxml
或 BeautifulSoup 4
。 lxml
被许多人认为是高级软件包(例如,它更快,您可以使用 xpaths),但安装起来可能很困难。所以我建议您尝试安装 lxml,如果可行,那就太好了。否则,安装 BeautifulSoup 4.
我用BeautifulSoup这么多年了,我更喜欢它;但是当我想使用 xpaths 搜索文档时,我也使用 lxml。
lxml 例子:
from lxml import etree
xml = '<body><b>In Body<b>Second level</b></b></body>'
tree = etree.fromstring(xml)
print etree.tostring(tree)
matching_tags = tree.xpath('/body/b/b')
inner_b_tag = matching_tags[0]
print inner_b_tag.text
--output:--
<body><b>In Body<b>Second level</b></b></body>
Second level
bs4 示例:
from bs4 import BeautifulSoup
xml = '<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, 'xml') #In BeautifulSoup 4, you pass a second argument to BeautifulSoup() to indicate that you are parsing xml.
print(soup)
body = soup.find('body')
inner_b_tag = body.b.b
print(inner_b_tag.string)
--output:--
<?xml version="1.0" encoding="utf-8"?>
<body><b>In Body<b>Second level</b></b></body>
Second level
我执行以下操作:
from BeautifulSoup import *
html = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(html)
soup.contents
结果我得到:
[<body><b>In Body</b><b>Second level</b></body>]
这对我来说很奇怪,因为我没有看到原来的 XML。最初我有一个包含一些文本 (In Body
) 的标签 <b>
,然后它包含另一个标签 <b>
。但是,BeautifulSoup
"thinks" 我有标签 <b>
并且在它之后(关闭后)我有另一个标签 <b>
。因此,标签不会被视为彼此嵌套。这是为什么?
已添加
对于那些抱怨我例子中 HTML 有效性的人,我做了以下例子:
xml = u'<aaa><bbb>In Body<bbb>Second level</bbb></bbb></aaa>'
soup = BeautifulSoup(xml)
soup.contents
哪个returns:
[<aaa><bbb>In Body</bbb><bbb>Second level</bbb></aaa>]
添加 2
如果我使用:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
我得到:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1344, in unknown_starttag
and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
AttributeError: 'list' object has no attribute 'text'
请注意,您使用的是过时的软件包,BeautifulSoup
:
This package is OBSOLETE. It has been replaced by the beautifulsoup4 package. You should use Beautiful Soup 4 for all new projects
BeautifulSoup 3 包含一些 XML 解析功能(BeautifulStoneSoup
),这些功能确实无法理解再次嵌套的相同标签(如 7stud 在 BeautifulSoup.BeautifulSoup
对于 BS3,bs4.BeautifulSoup
BS4.
BeautifulSoup 4 默认使用HTML规则解析;您需要明确告诉它使用 XML(需要安装 lxml
)。因此 BeautifulSoup 4 (PyPI beautifulsoup4
) 的示例:
>>> from bs4 import BeautifulSoup
>>> xml = u'<body><b>In Body<b>Second level</b></b></body>'
>>> soup = BeautifulSoup(xml, 'xml')
>>> soup.contents
[<body><b>In Body<b>Second level</b></b></body>]
>>> bs4.__version__
'4.1.3'
注意文档必须格式正确XML;绝不手软
如果您不使用 'xml'
参数,您将得到错误解析的文档:
>>> bs4.BeautifulSoup('<p><p></p></p>')
<html><body><p></p><p></p></body></html>
并与
>>> bs4.BeautifulSoup('<p><p></p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p/></p>
So, the tags are not perceived as nested into each other. Why is that?
根据BeautifulSoup源代码中的注释:
Tag nesting rules:
Most tags can't be nested at all. For instance, the occurance of a
<p>
tag should implicitly close the previous<p>
tag.<p>Para1<p>Para2 should be transformed into: <p>Para1</p><p>Para2
然后,源代码指定了几个包含标签名称的列表,根据 HTML 标准,这些标签名称允许嵌套在它们自身内——而 <b>
不是其中之一。
If I use:
xml = u'<body><b>In Body<b>Second level</b></b></body>' soup = BeautifulSoup(xml, ['lxml', 'xml'])
I get:
AttributeError: 'list' object has no attribute 'text'
您收到该错误是因为您无法将列表作为参数传递给 BeautifulSoup()。
为了提醒 BeautifulSoup 您没有解析 html,您需要使用 BeautifulStoneSoup()
。不幸的是,我的测试表明 BeautifulStoneSoup()
产生相同的 xml,因此看来 BeautifulStoneSoup()
将类似的嵌套规则应用于您的 <b>
标记。
如果您没有被限制使用 BeautifulSoup 3
,您应该使用 lxml
或 BeautifulSoup 4
。 lxml
被许多人认为是高级软件包(例如,它更快,您可以使用 xpaths),但安装起来可能很困难。所以我建议您尝试安装 lxml,如果可行,那就太好了。否则,安装 BeautifulSoup 4.
我用BeautifulSoup这么多年了,我更喜欢它;但是当我想使用 xpaths 搜索文档时,我也使用 lxml。
lxml 例子:
from lxml import etree
xml = '<body><b>In Body<b>Second level</b></b></body>'
tree = etree.fromstring(xml)
print etree.tostring(tree)
matching_tags = tree.xpath('/body/b/b')
inner_b_tag = matching_tags[0]
print inner_b_tag.text
--output:--
<body><b>In Body<b>Second level</b></b></body>
Second level
bs4 示例:
from bs4 import BeautifulSoup
xml = '<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, 'xml') #In BeautifulSoup 4, you pass a second argument to BeautifulSoup() to indicate that you are parsing xml.
print(soup)
body = soup.find('body')
inner_b_tag = body.b.b
print(inner_b_tag.string)
--output:--
<?xml version="1.0" encoding="utf-8"?>
<body><b>In Body<b>Second level</b></b></body>
Second level