Beautiful Soup:从 p 元素中分离出 span 元素
Beautiful Soup: Separating out span element from p element
我需要从我的总 p 元素中取出一个 span 元素
下面是我正在解析的p元素之一的具体例子
<p id="p-9">
<span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
inbred mouse strains.
</span>
We experimentally inoculated 21 mouse strains with the highly
pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
and monitored the animals for 30 days thereafter for signs of
morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
values varied from 40 50% egg infective doses (EID<sub>50</sub>)
for the influenza virus-susceptible strain DBA/2<sub>S</sub>
(susceptibility indicated by “S”) to more than 10<sup>6</sup>
EID<sub>50</sub> for the influenza virus-resistant strains
BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
(resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
1">Fig. 1</a>).
</p>
如果我将可变段落作为 bs4.element.Tag 并执行此操作
print(paragraph.text)
结果是
H5N1 virus pathogenic phenotypes among inbred mouse strains.We experimentally
inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus
A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter
for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50)
values varied from 40 50% egg infective doses (EID50) for the influenza
virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more
than 106 EID50 for the influenza virus-resistant strains BALB/cR and
BALB/cByR (resistance indicated by “R”) (Fig. 1).
正如您在第一句和第二句中看到的那样,它不会在跨度中的文本和段落其余部分中的文本之间创建 space。
最终看起来像这样:
“近交系小鼠中的 H5N1 病毒致病表型 strains.We 实验...”
如您所见,这会导致 2 个单独的句子在句点后没有 space,这很重要,因为我稍后将按句子拆分,而且大多数句子拆分器都会分隔有一个句点和一个 space 并且我的其他大部分句子都正确形成。
有什么方法可以用 bs4 将 span 中的文本与其余文本隔离开来,然后以适当的间距将它们连接在一起?
我假设您正在使用 get_result()
。你可以在 bs4 中做一个名为 strings
的替代方法。这给出了汤中所有字符串的数组。然后你可以 join
它们一起得到正确格式的文本:
from bs4 import BeautifulSoup
html_doc = """
<p>
<span>Some Text.</span>
Some text and probably other stuff.
</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(" ".join(soup.strings))
print(" ".join(soup.stripped_strings))
此外,我在您的示例中看到您有很多用于格式化的空格。您可以通过 stripped_strings
而不是
来摆脱这些
尝试:
import re
from bs4 import BeautifulSoup
html = '''
<p id="p-9">
<span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
inbred mouse strains.
</span>
We experimentally inoculated 21 mouse strains with the highly
pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
and monitored the animals for 30 days thereafter for signs of
morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
values varied from 40 50% egg infective doses (EID<sub>50</sub>)
for the influenza virus-susceptible strain DBA/2<sub>S</sub>
(susceptibility indicated by “S”) to more than 10<sup>6</sup>
EID<sub>50</sub> for the influenza virus-resistant strains
BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
(resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
1">Fig. 1</a>).
</p>
'''
soup = BeautifulSoup(html, 'lxml')
p = soup.select('p')
for text in p:
para = text.get_text(' ').replace('\n','')
para = re.sub(' +', ' ', para)
print(para.strip())
打印:
H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse...
等等..
另一个解决方案:
import re
from bs4 import BeautifulSoup
txt = '''<p id="p-9">
<span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
inbred mouse strains.
</span>
We experimentally inoculated 21 mouse strains with the highly
pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
and monitored the animals for 30 days thereafter for signs of
morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
values varied from 40 50% egg infective doses (EID<sub>50</sub>)
for the influenza virus-susceptible strain DBA/2<sub>S</sub>
(susceptibility indicated by “S”) to more than 10<sup>6</sup>
EID<sub>50</sub> for the influenza virus-resistant strains
BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
(resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
1">Fig. 1</a>).
</p>'''
soup = BeautifulSoup(txt, 'html.parser')
paragraph = soup.select_one('p')
# add space at the end of each span:
for span in paragraph.select('span'):
span.append(BeautifulSoup(' ', 'html.parser'))
# post-process the text:
print(re.sub(r'\s{2,}', ' ', paragraph.text).strip())
打印:
H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50) values varied from 40 50% egg infective doses (EID50) for the influenza virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more than 106 EID50 for the influenza virus-resistant strains BALB/cR and BALB/cByR (resistance indicated by “R”) (Fig. 1).
我需要从我的总 p 元素中取出一个 span 元素
下面是我正在解析的p元素之一的具体例子
<p id="p-9">
<span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
inbred mouse strains.
</span>
We experimentally inoculated 21 mouse strains with the highly
pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
and monitored the animals for 30 days thereafter for signs of
morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
values varied from 40 50% egg infective doses (EID<sub>50</sub>)
for the influenza virus-susceptible strain DBA/2<sub>S</sub>
(susceptibility indicated by “S”) to more than 10<sup>6</sup>
EID<sub>50</sub> for the influenza virus-resistant strains
BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
(resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
1">Fig. 1</a>).
</p>
如果我将可变段落作为 bs4.element.Tag 并执行此操作
print(paragraph.text)
结果是
H5N1 virus pathogenic phenotypes among inbred mouse strains.We experimentally
inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus
A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter
for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50)
values varied from 40 50% egg infective doses (EID50) for the influenza
virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more
than 106 EID50 for the influenza virus-resistant strains BALB/cR and
BALB/cByR (resistance indicated by “R”) (Fig. 1).
正如您在第一句和第二句中看到的那样,它不会在跨度中的文本和段落其余部分中的文本之间创建 space。
最终看起来像这样:
“近交系小鼠中的 H5N1 病毒致病表型 strains.We 实验...”
如您所见,这会导致 2 个单独的句子在句点后没有 space,这很重要,因为我稍后将按句子拆分,而且大多数句子拆分器都会分隔有一个句点和一个 space 并且我的其他大部分句子都正确形成。
有什么方法可以用 bs4 将 span 中的文本与其余文本隔离开来,然后以适当的间距将它们连接在一起?
我假设您正在使用 get_result()
。你可以在 bs4 中做一个名为 strings
的替代方法。这给出了汤中所有字符串的数组。然后你可以 join
它们一起得到正确格式的文本:
from bs4 import BeautifulSoup
html_doc = """
<p>
<span>Some Text.</span>
Some text and probably other stuff.
</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(" ".join(soup.strings))
print(" ".join(soup.stripped_strings))
此外,我在您的示例中看到您有很多用于格式化的空格。您可以通过 stripped_strings
而不是
尝试:
import re
from bs4 import BeautifulSoup
html = '''
<p id="p-9">
<span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
inbred mouse strains.
</span>
We experimentally inoculated 21 mouse strains with the highly
pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
and monitored the animals for 30 days thereafter for signs of
morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
values varied from 40 50% egg infective doses (EID<sub>50</sub>)
for the influenza virus-susceptible strain DBA/2<sub>S</sub>
(susceptibility indicated by “S”) to more than 10<sup>6</sup>
EID<sub>50</sub> for the influenza virus-resistant strains
BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
(resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
1">Fig. 1</a>).
</p>
'''
soup = BeautifulSoup(html, 'lxml')
p = soup.select('p')
for text in p:
para = text.get_text(' ').replace('\n','')
para = re.sub(' +', ' ', para)
print(para.strip())
打印:
H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse...
等等..
另一个解决方案:
import re
from bs4 import BeautifulSoup
txt = '''<p id="p-9">
<span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
inbred mouse strains.
</span>
We experimentally inoculated 21 mouse strains with the highly
pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
and monitored the animals for 30 days thereafter for signs of
morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
values varied from 40 50% egg infective doses (EID<sub>50</sub>)
for the influenza virus-susceptible strain DBA/2<sub>S</sub>
(susceptibility indicated by “S”) to more than 10<sup>6</sup>
EID<sub>50</sub> for the influenza virus-resistant strains
BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
(resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
1">Fig. 1</a>).
</p>'''
soup = BeautifulSoup(txt, 'html.parser')
paragraph = soup.select_one('p')
# add space at the end of each span:
for span in paragraph.select('span'):
span.append(BeautifulSoup(' ', 'html.parser'))
# post-process the text:
print(re.sub(r'\s{2,}', ' ', paragraph.text).strip())
打印:
H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50) values varied from 40 50% egg infective doses (EID50) for the influenza virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more than 106 EID50 for the influenza virus-resistant strains BALB/cR and BALB/cByR (resistance indicated by “R”) (Fig. 1).