如何从 BeautifulSoup 中的 <br> 标签之间提取文本
How to extract text from between the <br> tags in BeautifulSoup
我想做的是只从 <td>
元素中抓取公司名称,该元素有多个 <br>
标签。仅供参考,一些 <td>
有一个公司名称,而另一些则有两个。请参阅下面的 <td>
元素:
<td id="MainContent_DisassociatedRegistrationsCell" colspan="2">
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=332673">332673</a>
</strong>
</p>
BAY AREA REMODELING CO
<br>
5230 EAST 12TH
<br>
OAKLAND, CA 94601
<br>
<strong>Effective Dates:</strong>
09/16/1982 - 06/30/1984
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=377133">377133</a>
</strong>
</p>
SAVAGE ROOFING COMPANY
<br>
3055 ALVARADO STREET
<br>
SAN LEANDRO, CA 94577
<br>
<strong>Effective Dates:</strong>
07/01/1982 - 03/31/1985
</td>
所以从上面的 <td>
元素,我想要输出:
BAY AREA REMODELING CO
SAVAGE ROOFING COMPANY
找到所需的 p
标签后使用 next_sibling
例如:
from bs4 import BeautifulSoup
html = """<td id="MainContent_DisassociatedRegistrationsCell" colspan="2">
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=332673">332673</a>
</strong>
</p>
BAY AREA REMODELING CO
<br>
5230 EAST 12TH
<br>
OAKLAND, CA 94601
<br>
<strong>Effective Dates:</strong>
09/16/1982 - 06/30/1984
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=377133">377133</a>
</strong>
</p>
SAVAGE ROOFING COMPANY
<br>
3055 ALVARADO STREET
<br>
SAN LEANDRO, CA 94577
<br>
<strong>Effective Dates:</strong>
07/01/1982 - 03/31/1985
</td>"""
soup = BeautifulSoup(html, 'html.parser')
for p in soup.find_all('p'):
print(p.next_sibling.strip())
输出:
BAY AREA REMODELING CO
SAVAGE ROOFING COMPANY
使用BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, "html.parser")
>>> [p.next_sibling.strip() for p in soup.findAll("p")]
['BAY AREA REMODELING CO', 'SAVAGE ROOFING COMPANY']
我想做的是只从 <td>
元素中抓取公司名称,该元素有多个 <br>
标签。仅供参考,一些 <td>
有一个公司名称,而另一些则有两个。请参阅下面的 <td>
元素:
<td id="MainContent_DisassociatedRegistrationsCell" colspan="2">
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=332673">332673</a>
</strong>
</p>
BAY AREA REMODELING CO
<br>
5230 EAST 12TH
<br>
OAKLAND, CA 94601
<br>
<strong>Effective Dates:</strong>
09/16/1982 - 06/30/1984
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=377133">377133</a>
</strong>
</p>
SAVAGE ROOFING COMPANY
<br>
3055 ALVARADO STREET
<br>
SAN LEANDRO, CA 94577
<br>
<strong>Effective Dates:</strong>
07/01/1982 - 03/31/1985
</td>
所以从上面的 <td>
元素,我想要输出:
BAY AREA REMODELING CO
SAVAGE ROOFING COMPANY
找到所需的 p
标签后使用 next_sibling
例如:
from bs4 import BeautifulSoup
html = """<td id="MainContent_DisassociatedRegistrationsCell" colspan="2">
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=332673">332673</a>
</strong>
</p>
BAY AREA REMODELING CO
<br>
5230 EAST 12TH
<br>
OAKLAND, CA 94601
<br>
<strong>Effective Dates:</strong>
09/16/1982 - 06/30/1984
<p style="background-color:#CCCCCC;width:100%;text-align:center">
<strong>License #:
<a href="LicenseDetail.aspx?LicNum=377133">377133</a>
</strong>
</p>
SAVAGE ROOFING COMPANY
<br>
3055 ALVARADO STREET
<br>
SAN LEANDRO, CA 94577
<br>
<strong>Effective Dates:</strong>
07/01/1982 - 03/31/1985
</td>"""
soup = BeautifulSoup(html, 'html.parser')
for p in soup.find_all('p'):
print(p.next_sibling.strip())
输出:
BAY AREA REMODELING CO
SAVAGE ROOFING COMPANY
使用BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, "html.parser")
>>> [p.next_sibling.strip() for p in soup.findAll("p")]
['BAY AREA REMODELING CO', 'SAVAGE ROOFING COMPANY']