Python 美汤如何获取深层嵌套元素

Question

我有一个具有以下结构的页面：

<div id ="a">
    <table>
        <td> 
            <!-- many tables and divs here -->
        </td>
        <td>
            <table></table>
            <table></table>
            <div class="tabber">
                <table></table>
                <table></table>  <!-- TARGET TABLE -->
            </div>
        </td>
    </table>
</div>

没错，不幸的是，除了 "tabber".

外，目标或附近没有任何 ID 或 classes

我试图获取 div 元素：

content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)

stats_div = soup.findAll('div', class_ = "tabber")[1] # 1 because there are 4 elements on page with that class and number 2 is the target one

但是没有用，总是没有输出。

我试图从头遍历整棵树得到目标table:

stats_table = soup.find(id='a').findChildren('table')[0].findChildren('td')[1].findChildren('div')[0].findChildren('table')[1]

但是也没用。显然 findChildren('td') 没有得到第一个 table 的直接子代，而是得到所有后代。超过 100 个 td 元素。

如何获得元素的直接子元素？

是否有更简洁的方法来遍历如此丑陋的嵌套树？为什么我不能 select class div？它会简化一切。

Answer 1

None 您显示的代码似乎反映了该页面上的任何内容：

没有带有 id='a' 的 div 标签。事实上，没有一个标签具有该属性。这就是为什么你的最后一个命令 stats_table = ... 失败了。

正好有 3 个 div 标签的 class 属性等于 tabber，而不是 4:

>>> len(soup.find_all('div', class_="tabber"))
3

而且它们也不为空：

>>> len(soup.find_all('div', class_="tabber")[1])
7

classtabber 没有一个 div 标签只有 2 个 table 子代，但我认为这是因为你已经大大减少了你自己的榜样。

如果您想抓取像本网站这样的网站，您无法通过唯一 id 轻松地 select 标记，那么您别无选择，只能使用其他属性来帮助自己，比如标签名称。有时标签在 DOM 中的相对位置也是一种有用的技术。

对于您的具体问题，您可以使用 title 属性来达到很好的效果：

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> url = 'http://www.soccerstats.com/team.asp?league=england&teamid=24'
>>> soup = BeautifulSoup(urllib2.urlopen(url).read(), 'lxml')
>>> all_stats = soup.find('div', id='team-matches-and stats')
>>> left_column, right_column = [x for x in all_stats.table.tr.children if x.name == 'td']
>>> table1, table2 = [x for x in right_column.children if x.name == 'table']  # the two tables at the top right
>>> [x['title'] for x in right_column.find_all('div', class_='tabbertab')]
['Stats', 'Scores', 'Goal times', 'Overall', 'Home', 'Away']

这里的最后一部分是有趣的部分：右下角的所有表格都有 title 属性，这将使您更容易 select 它们。此外，这些属性使标签在 soup 中是唯一的，因此您可以 select 直接从根开始：

>>> stats_div = soup.find('div', class_="tabbertab", title="Stats")
>>> len(stats_div.find_all('table', class_="stat"))
3

这 3 项对应于 "current streaks"、"scoring" 和 "home/away advantage" 子项。

Python 美汤如何获取深层嵌套元素

Python beautiful soup how to get deep nested elements

html

python

beautifulsoup