如何仅提取包含粗体元素的 html 段落元素
How to extract html paragraph elements only if they contain bold elements
我正在尝试从维基百科页面中提取 ID = 'See' 下的段落元素,全部放入列表中。
使用:
import bs4
import requests
response = requests.get("https://wikitravel.org/en/Bhopal")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
plot = []
# find the node with id of "Plot"
mark = html.find(id="See")
# walk through the siblings of the parent (H2) node
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
if elt.name == "h2":
break
if hasattr(elt, "text"):
plot.append(elt.text)
现在我只想提取其中包含粗体元素的段落,我该如何实现?
这是您要找的吗?
我在您的代码中添加了几行。我使用了 lxml 解析器。(html 也很好)。
from bs4 import BeautifulSoup as bs
import lxml
import ssl
import requests
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://wikitravel.org/en/Bhopal'
content = requests.get('https://wikitravel.org/en/Bhopal').text
soup = bs(content, 'lxml')
plot =[]
mark = soup.find(id="See")
# # # walk through the siblings of the parent (H2) node
# # # until we reach the next H2 node
for elt in mark.parent.next_siblings:
if elt.name == "h2":
break
if hasattr(elt, "text") and (elt.find('b')):
plot.append(elt.text)
print(*plot,sep=('\n')) #Just to print the list in a readable way
我的 jupyter notebook 上输出的前几行:
我正在尝试从维基百科页面中提取 ID = 'See' 下的段落元素,全部放入列表中。
使用:
import bs4
import requests
response = requests.get("https://wikitravel.org/en/Bhopal")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
plot = []
# find the node with id of "Plot"
mark = html.find(id="See")
# walk through the siblings of the parent (H2) node
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
if elt.name == "h2":
break
if hasattr(elt, "text"):
plot.append(elt.text)
现在我只想提取其中包含粗体元素的段落,我该如何实现?
这是您要找的吗? 我在您的代码中添加了几行。我使用了 lxml 解析器。(html 也很好)。
from bs4 import BeautifulSoup as bs
import lxml
import ssl
import requests
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://wikitravel.org/en/Bhopal'
content = requests.get('https://wikitravel.org/en/Bhopal').text
soup = bs(content, 'lxml')
plot =[]
mark = soup.find(id="See")
# # # walk through the siblings of the parent (H2) node
# # # until we reach the next H2 node
for elt in mark.parent.next_siblings:
if elt.name == "h2":
break
if hasattr(elt, "text") and (elt.find('b')):
plot.append(elt.text)
print(*plot,sep=('\n')) #Just to print the list in a readable way
我的 jupyter notebook 上输出的前几行: