如何用beautifulsoup4提取html？

Question

html 看起来像这样：

<td class='Thistd'><a ><img /></a>Here is some text.</td>

我只想获取<td>中的字符串。我不需要 <a>...</a>。我该怎么做？

我的代码：

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td
    print '============='

我得到的是<td class='Thistd'><a ><img /></a>Here is some text.</td>

但我只需要Here is some text.

Answer 1

代码：

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td.text#the only change you need to do
    print '============='

输出：

Here is some text.
=============

注：

.text 仅用于获取给定 bs4 对象的文本属性，在本例中为 td 标签。有关更多信息，请查看 official site

Answer 2

使用 td.getText() 从您的元素中获取纯文本。

即)

for td in tds:
    print td.getText()
    print '============='

输出：

Here is some text.
=============

编辑：

您可以删除 <a> 元素然后打印左侧。.extract 方法从可用的 bs4 对象中删除该特定标签

即)

for td in tds:
    td.a.extract()
    print td

输出：

<td class="Thistd">Here is some<b>here is a b tag </b></td>

如何用beautifulsoup4提取html？

How to extract html with beautifulsoup4?

python

beautifulsoup