BeautifulSoup: 在源代码中抓取具有相同属性集的不同数据集

Question

我正在使用 BeautifulSoup 模块从 Twitter 帐户中抓取关注者总数和推文总数。但是，当我尝试检查网页上各个字段的元素时，我发现这两个字段都包含在同一组 html 属性中：

关注者

<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" data-nav="followers" href="/IAmJericho/followers" data-original-title="2,469,681 Followers">
          <span class="ProfileNav-label">Followers</span>
          <span class="ProfileNav-value" data-is-compact="true">2.47M</span>
</a>

推文数

    <a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" data-nav="tweets" tabindex="0" data-original-title="21,769 Tweets">
                <span class="ProfileNav-label">Tweets</span>
                <span class="ProfileNav-value" data-is-compact="true">21.8K</span>
</a>

我写的挖矿脚本：

import requests
import urllib2
from bs4 import BeautifulSoup

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for e in res.findAll('span', {'data-is-compact':'true'}):
    followers = e.text

print followers

但是，由于两者的值，推文总数和关注者总数包含在同一组 HTML 属性中，即包含在带有 class = "ProfileNav-value" 的 span 标签中而data-is-compact = "true"，我只得到运行上述脚本返回的关注者总数的结果。

我怎么可能从 BeautifulSoup 中提取包含在相似 HTML 属性中的两组信息？

Answer 1

在这种情况下，实现它的一种方法是检查 data-is-compact="true" 对于您要提取的每条数据只出现两次，并且您还知道 tweets 第一个和followers 第二，这样你就可以得到一个包含相同顺序的标题的列表，并使用 zip 将它们加入一个元组中以同时打印两者，例如：

import urllib2
from bs4 import BeautifulSoup

profile = ['Tweets', 'Followers']

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for p, d in zip(profile, res.find_all('span', { 'data-is-compact': "true"})):
    print p, d.text

它产生：

Tweets 21,8K                                                                                                                                                                                                                                                                   
Followers 2,47M

BeautifulSoup: 在源代码中抓取具有相同属性集的不同数据集

BeautifulSoup: Scraping different data sets having same set of attributes in the source code

python

beautifulsoup

web-scraping

python-2.7

python-requests