使用 bs4 查找包含文本的 html 标签 (h2)
using bs4 to find a html tag (h2) having text
对于 html 代码的这一部分:
html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""
我将使用 beautifulsoup 来查找其文本等于 "Content Logical Definition" 的 h2 和下一个兄弟姐妹。但是beautifulsoup找不到h2。以下是我的代码:
soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings
这是一个错误:
AttributeError: 'NoneType' object has no attribute 'nextsibilings'
文中有好几个"h2",但唯一让这个h2独一无二的字符是"Content Logical Definition"。找到这个 h2 后,我将从 table 中提取数据并在其下列出。
主要问题是您定位 h2
元素以从中查找兄弟姐妹的方式。我会使用 function 而不是检查 Content Logical Definition
是否在文本中:
soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
此外,要获得下一个兄弟姐妹,您应该使用 .next_siblings
而不是 nextsibilings
。
演示:
>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
... print(sibling)
...
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>
虽然,现在知道你正在处理的真实 HTML 以及它有多混乱,我认为你应该迭代兄弟姐妹,打破下一个 h2
或者如果你在那之前找到一个table
。实际执行:
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.hl7.org/fhir/valueset-activity-reason.html',
'https://www.hl7.org/fhir/valueset-age-units.html'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
table = None
for sibling in h2.find_next_siblings():
if sibling.name == "table":
table = sibling
break
if sibling.name == "h2":
break
print(table)
对于 html 代码的这一部分:
html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""
我将使用 beautifulsoup 来查找其文本等于 "Content Logical Definition" 的 h2 和下一个兄弟姐妹。但是beautifulsoup找不到h2。以下是我的代码:
soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings
这是一个错误:
AttributeError: 'NoneType' object has no attribute 'nextsibilings'
文中有好几个"h2",但唯一让这个h2独一无二的字符是"Content Logical Definition"。找到这个 h2 后,我将从 table 中提取数据并在其下列出。
主要问题是您定位 h2
元素以从中查找兄弟姐妹的方式。我会使用 function 而不是检查 Content Logical Definition
是否在文本中:
soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
此外,要获得下一个兄弟姐妹,您应该使用 .next_siblings
而不是 nextsibilings
。
演示:
>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
... print(sibling)
...
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>
虽然,现在知道你正在处理的真实 HTML 以及它有多混乱,我认为你应该迭代兄弟姐妹,打破下一个 h2
或者如果你在那之前找到一个table
。实际执行:
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.hl7.org/fhir/valueset-activity-reason.html',
'https://www.hl7.org/fhir/valueset-age-units.html'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
table = None
for sibling in h2.find_next_siblings():
if sibling.name == "table":
table = sibling
break
if sibling.name == "h2":
break
print(table)