Python beautifulsoup.find_all(text=) 是否存在 Unicode 字符问题?

Does the Python beautifulsoup.find_all(text=) have a problem with Unicode characters?

我正在使用 beautifulsoup 尝试根据其内容在 xml 解析树中定位 P 标签:

# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup

# Determine today's date.
today = date.today()

# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121&section=121.1"

# Initialize a requests Response object.
page = requests.get(url)

# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")

# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
    print(i)
    i.decompose()

当 运行 此代码时,我收到一个 NoneType 对象(打印 None 到控制台),即使我通过查看 XML 知道该元素存在文件(包括尾随的 nbsp)。美汤是不是Unicode有问题,还是我漏了什么?

谢谢!

主要问题是 text="(See § 125.4 of this subchapter for exemptions.) " 寻找完全匹配,但找不到,因为在您的 xml 中它看起来像 (<I>See</I> § 125.4 of this subchapter for exemptions.) .

您可以使用 css selectors:-soup-contains() 修复该行为:

for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
    print(i)
    i.decompose()
例子
from datetime import date
import requests
from bs4 import BeautifulSoup

# Determine today's date.
today = date.today()

# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121&section=121.1"

# Initialize a requests Response object.
page = requests.get(url)

# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")

# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
    print(i)
    i.decompose()