BeautifulSoup4 find_all() 在 extract() 或 decompose() 之后表现异常

Question

我观察到使用 BeautifulSoup4 时发现的奇怪行为。我有以下 XML（文件名：fake_product.xml）：

<product acronym="ACRO1">
<formats>
    <format id="format1">
    </format>
    <format id="format2">
    </format>
    <format id="format3">
    </format>
    <format id="format4">
    </format>
    <format id="format5">
    </format>
    <format id="format6">
    </format>
</formats>
</product>

此 TestCase 失败：

import unittest
from bs4 import BeautifulSoup


class Test(unittest.TestCase):

    def setUp(self):
        with open('fake_product.xml') as f:
            self.soup = BeautifulSoup(f, 'xml')

    def test_product_removal(self):
        output = len(self.soup.find_all('format'))
        expected = 6
        self.assertEqual(output, expected)

        format_to_delete = self.soup.find(id='format2')
        format_to_delete.extract()
        #self.soup = BeautifulSoup(self.soup.prettify(), 'xml')
        output = len(self.soup.find_all('format'))
        expected -= 1
        self.assertEqual(output, expected)

原因是 find_all() 无法再找到所有格式。如果我这样做print self.soup.prettify() 我觉得一切都很好。
如果我取消注释 TestCase 中的注释行并在 extract() 之后创建一个新的 BeautifulSoup 对象，find_all() 似乎再次正常工作并且 TestCase 成功。

有人可以向我解释一下这种行为吗？

Answer 1

这是4.4.0引入的bug，见BeautifulSoup 4 project bug tracker:

In some situations, it seems calling extract() does not correctly adjust the next_sibling attribute of the previous element. This leaves the extracted element in the descendant generator. When later calling find(...) or find_all(...), the search then terminates at the extracted element, causing results to be missed.

This bug 也相关并且包含一个潜在的修复：

Lines 265, 267, 274, 277 need != changing to is not

Line 290 needs == changing to is

我可以确认它修复了您的特定测试。

如果您不习惯编辑 BeautifulSoup 源代码，那么解决方法是像您一样重建树，或者降级到 4.3.2 直到出现修复程序.

BeautifulSoup4 find_all() 在 extract() 或 decompose() 之后表现异常

BeautifulSoup4 find_all() behaves strange after extract() or decompose()

python

beautifulsoup