Python beautifulsoup.find_all(text=) 是否存在 Unicode 字符问题？

Question

我正在使用 beautifulsoup 尝试根据其内容在 xml 解析树中定位 P 标签：

# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup

# Determine today's date.
today = date.today()

# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121&section=121.1"

# Initialize a requests Response object.
page = requests.get(url)

# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")

# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
    print(i)
    i.decompose()

当运行此代码时，我收到一个 NoneType 对象（打印 None 到控制台），即使我通过查看 XML 知道该元素存在文件（包括尾随的 nbsp）。美汤是不是Unicode有问题，还是我漏了什么？

谢谢！

Answer 1

主要问题是 text="(See § 125.4 of this subchapter for exemptions.) " 寻找完全匹配，但找不到，因为在您的 xml 中它看起来像 (<I>See</I> § 125.4 of this subchapter for exemptions.) .

您可以使用 css selectors 和 :-soup-contains() 修复该行为：

for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
    print(i)
    i.decompose()

例子

from datetime import date
import requests
from bs4 import BeautifulSoup

# Determine today's date.
today = date.today()

# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121&section=121.1"

# Initialize a requests Response object.
page = requests.get(url)

# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")

# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
    print(i)
    i.decompose()

Python beautifulsoup.find_all(text=) 是否存在 Unicode 字符问题？

Does the Python beautifulsoup.find_all(text=) have a problem with Unicode characters?

python

xml

beautifulsoup

web-scraping

例子