如何使用 BeautifulSoup4 优雅地获取 html td 的顶级文本?
How to elegantly get top level text of a html td with BeautifulSoup4?
下面是一个用 beautifulsoup4 解析的简单 html 段,我希望提取顶级原始文本 hello.
mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')
我尝试了几种直观的方法,但没有预期的结果:
mysoup.text # u'helloworld'
mysoup.contents # [<html><body><td>hello<script type="text/javascript">world</script></td></body></html>]
list(mysoup.strings) # [u'hello ', u'world']
那么如何实现这个目标呢?
首先,获取对 td
节点的引用。然后,遍历其子项并查看其中的哪些 are strings:
from bs4 import BeautifulSoup
mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')
td = mysoup.find('td')
print [s for s in td.children if isinstance(s, basestring)]
下面是一个用 beautifulsoup4 解析的简单 html 段,我希望提取顶级原始文本 hello.
mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')
我尝试了几种直观的方法,但没有预期的结果:
mysoup.text # u'helloworld'
mysoup.contents # [<html><body><td>hello<script type="text/javascript">world</script></td></body></html>]
list(mysoup.strings) # [u'hello ', u'world']
那么如何实现这个目标呢?
首先,获取对 td
节点的引用。然后,遍历其子项并查看其中的哪些 are strings:
from bs4 import BeautifulSoup
mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')
td = mysoup.find('td')
print [s for s in td.children if isinstance(s, basestring)]