如何用beautifulsoup4提取html?
How to extract html with beautifulsoup4?
html 看起来像这样:
<td class='Thistd'><a ><img /></a>Here is some text.</td>
我只想获取<td>
中的字符串。我不需要 <a>...</a>
。
我该怎么做?
我的代码:
from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""
soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
print td
print '============='
我得到的是<td class='Thistd'><a ><img /></a>Here is some text.</td>
但我只需要Here is some text.
代码:
from bs4 import BeautifulSoup
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>"""
soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
print td.text#the only change you need to do
print '============='
输出:
Here is some text.
=============
注:
.text
仅用于获取给定 bs4 对象的文本属性,在本例中为 td
标签。有关更多信息,请查看 official site
使用 td.getText()
从您的元素中获取纯文本。
即)
for td in tds:
print td.getText()
print '============='
输出:
Here is some text.
=============
编辑:
您可以删除 <a>
元素然后打印左侧。.extract
方法从可用的 bs4 对象中删除该特定标签
即)
for td in tds:
td.a.extract()
print td
输出:
<td class="Thistd">Here is some<b>here is a b tag </b></td>
html 看起来像这样:
<td class='Thistd'><a ><img /></a>Here is some text.</td>
我只想获取<td>
中的字符串。我不需要 <a>...</a>
。
我该怎么做?
我的代码:
from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""
soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
print td
print '============='
我得到的是<td class='Thistd'><a ><img /></a>Here is some text.</td>
但我只需要Here is some text.
代码:
from bs4 import BeautifulSoup
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>"""
soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
print td.text#the only change you need to do
print '============='
输出:
Here is some text.
=============
注:
.text
仅用于获取给定 bs4 对象的文本属性,在本例中为 td
标签。有关更多信息,请查看 official site
使用 td.getText()
从您的元素中获取纯文本。
即)
for td in tds:
print td.getText()
print '============='
输出:
Here is some text.
=============
编辑:
您可以删除 <a>
元素然后打印左侧。.extract
方法从可用的 bs4 对象中删除该特定标签
即)
for td in tds:
td.a.extract()
print td
输出:
<td class="Thistd">Here is some<b>here is a b tag </b></td>