Beautifulsoup - 根据前一个 div 子标签从下一个 div 子标签中提取文本
Beautifulsoup - Extract text from next div sub tag based on previous div sub tag
我正在尝试根据之前的 div-span 提取下一个 div 中的数据 text.below 是 html 内容,
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven
<br></span></div>
我尝试使用
查找文本
n_field = soup.find('span', text="Name\")
然后尝试使用
从下一个兄弟姐妹那里获取文本
n_field.next_sibling()
但是,由于字段中的“\n”,我无法找到范围并提取 next_sibling 文本。
简而言之,我正在尝试按照以下格式形成字典,
{"Name": "Ven"}
如有任何帮助或想法,我们将不胜感激。
您可以使用 re
而不是 bs4
。
import re
html = """
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;">
<span style="font-family: b'Times-Bold'; font-size:13px">Name
<br>
</span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;">
<span style="font-family: b'Helvetica'; font-size:13px">Ven
<br>
</span>
"""
mo = re.search(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
print(mo.groups())
# for consecutive cases use re.finditer or re.findall
html *= 5
mo = re.finditer(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
for match in mo:
print(match.groups())
for (key, value) in re.findall(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL):
print(key, value)
我试了一下,出于某种原因,即使在删除 \n 之后,我也无法获得 nextSibling(),所以我尝试了一种不同的策略,如下所示:
from bs4 import BeautifulSoup
"""Lets get rid of the \n"""
html = """<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven<br></span></div>""".replace("\n","")
soup = BeautifulSoup(html)
span_list = soup.findAll("span")
result = {span_list[0].text:span_list[1].text.replace(" ","")}
结果为:
{'Name': 'Ven'}
我正在尝试根据之前的 div-span 提取下一个 div 中的数据 text.below 是 html 内容,
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven
<br></span></div>
我尝试使用
查找文本n_field = soup.find('span', text="Name\")
然后尝试使用
从下一个兄弟姐妹那里获取文本n_field.next_sibling()
但是,由于字段中的“\n”,我无法找到范围并提取 next_sibling 文本。
简而言之,我正在尝试按照以下格式形成字典,
{"Name": "Ven"}
如有任何帮助或想法,我们将不胜感激。
您可以使用 re
而不是 bs4
。
import re
html = """
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;">
<span style="font-family: b'Times-Bold'; font-size:13px">Name
<br>
</span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;">
<span style="font-family: b'Helvetica'; font-size:13px">Ven
<br>
</span>
"""
mo = re.search(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
print(mo.groups())
# for consecutive cases use re.finditer or re.findall
html *= 5
mo = re.finditer(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
for match in mo:
print(match.groups())
for (key, value) in re.findall(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL):
print(key, value)
我试了一下,出于某种原因,即使在删除 \n 之后,我也无法获得 nextSibling(),所以我尝试了一种不同的策略,如下所示:
from bs4 import BeautifulSoup
"""Lets get rid of the \n"""
html = """<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven<br></span></div>""".replace("\n","")
soup = BeautifulSoup(html)
span_list = soup.findAll("span")
result = {span_list[0].text:span_list[1].text.replace(" ","")}
结果为:
{'Name': 'Ven'}