Python 网页抓取 td class 跨度

Question

Python 和网络抓取的新手...我一直在寻找抓取突出显示的代码部分，以便检索数字 1.16、7.50 和 14.67，但我对使用 td 并不满意, class, table-matches__odds pageSoup.find_all...有人知道我在这里遗漏了什么吗？

我正在使用 beautifulsoup 4.

Answer 1

尴尬。

首先我找到了'ratio'项（赔率？）的列，作为我们要掠夺的行内的参考点。将它们放在名为 ratio.

的列表中

然后我查看了下一个兄弟姐妹的典型元素ratio，即第一个。

您只对 table 的第一行感兴趣，因此我选择了 ratio[0] 并询问了它的下一个兄弟项，它们都是 td 元素。

然后我根据它们的内部结构从每一个中提取了你想要的东西。唯一复杂的是第一个。我使用 descendants 迭代器获取它的后代，要求最里面的那个，然后得到那个的属性。

>>> import bs4
>>> import requests
>>> page = requests.get('http://www.betexplorer.com/soccer/scotland/premiership-2016-2017/results/').text
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> ratio = soup.findAll('td', attrs={'class': 'h-text-center'})
>>> ratio[0].findNextSiblings()
[<td class="table-matches__odds colored"><span><span><span data-odd="1.16"></span></span></span></td>, <td class="table-matches__odds" data-odd="7.50"></td>, <td class="table-matches__odds" data-odd="14.67"></td>, <td class="h-text-right h-text-no-wrap">21.05.2017</td>]
>>> len(ratio)
15
>>> zeroth_ratio_sibs = ratio[0].findNextSiblings()
>>> first_item = list(zeroth_ratio_sibs[0].descendants)[2].attrs['data-odd']
>>> first_item
'1.16'
>>> second_item = zeroth_ratio_sibs[1].attrs['data-odd']
>>> second_item
'7.50'
>>> third_item = zeroth_ratio_sibs[2].attrs['data-odd']
>>> third_item 
'14.67'

Python 网页抓取 td class 跨度

Python Web Scraping td class span

html

python

screen-scraping

beautifulsoup

web