Python beautifulsoup.find_all(text=) 是否存在 Unicode 字符问题?
Does the Python beautifulsoup.find_all(text=) have a problem with Unicode characters?
我正在使用 beautifulsoup 尝试根据其内容在 xml 解析树中定位 P 标签:
# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
print(i)
i.decompose()
当 运行 此代码时,我收到一个 NoneType 对象(打印 None 到控制台),即使我通过查看 XML 知道该元素存在文件(包括尾随的 nbsp)。美汤是不是Unicode有问题,还是我漏了什么?
谢谢!
主要问题是 text="(See § 125.4 of this subchapter for exemptions.) "
寻找完全匹配,但找不到,因为在您的 xml 中它看起来像 (<I>See</I> § 125.4 of this subchapter for exemptions.)
.
您可以使用 css selectors
和 :-soup-contains()
修复该行为:
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()
例子
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()
我正在使用 beautifulsoup 尝试根据其内容在 xml 解析树中定位 P 标签:
# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
print(i)
i.decompose()
当 运行 此代码时,我收到一个 NoneType 对象(打印 None 到控制台),即使我通过查看 XML 知道该元素存在文件(包括尾随的 nbsp)。美汤是不是Unicode有问题,还是我漏了什么?
谢谢!
主要问题是 text="(See § 125.4 of this subchapter for exemptions.) "
寻找完全匹配,但找不到,因为在您的 xml 中它看起来像 (<I>See</I> § 125.4 of this subchapter for exemptions.)
.
您可以使用 css selectors
和 :-soup-contains()
修复该行为:
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()
例子
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()