Python

Question

我正在尝试使用 BeautifulSoup 抓取 coinmarketcap.com（我知道有一个 API，出于培训目的，我想使用 BeautifulSoup）。到目前为止爬取的每条信息都非常容易 select，但现在我希望“持有人统计信息”看起来像这样：

holder stats

我的 selecting 包含所需信息的特定 div 的测试代码如下所示：

import requests
from bs4 import BeautifulSoup

url = 'https://coinmarketcap.com/currencies/bitcoin/holders/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
holders = soup.select('div', class_='n0m7sa-0 kkBhMM')
print(holders)

print(holders) 的输出不是 div 的预期内容，而是网站的全部 html 内容。我附加了一张图片，因为输出代码太长了。

Output Code

有谁知道，为什么会这样？

Answer 1

当您想用作 css 选择器时，您应该使用 .select()。在这种情况下，holders = soup.select('div', class_='n0m7sa-0 kkBhMM') class 部分基本上被忽略了......它会找到所有 <div> 和任何 class。要指定特定的 class，请使用 .find_all()，或更改您的 .select()

holders = soup.select('div.n0m7sa-0.kkBhMM')

或

holders = soup.find_all('div', class_='n0m7sa-0 kkBhMM')

现在在这两种情况下，它将 return None 或一个空列表。那是因为 class 属性不在源 html 中。此网站是动态的，因此这些 classes 是在初始请求后生成的。所以你要么先用Selenium渲染页面，然后拉取html，要么看有没有api直接获取数据源

有一个api获取数据：

import requests
import pandas as pd

alpha = ['count', 'ratio']
payload = {
'id': '1',
'range': '7d'}


for each in alpha:
        url = f'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/detail/holders/{each}'
        jsonData = requests.get(url, params=payload).json()['data']['points']
        
        if each == 'count':
            count_df = pd.DataFrame.from_dict(jsonData,orient='index')
            count_df = count_df.rename(columns={0:'Total Addresses'})
            
        else:
            ratio_df = pd.DataFrame.from_dict(jsonData,orient='index')
            df = count_df.merge(ratio_df, how='left', left_index=True, right_index=True)
            
df = df.sort_index()

输出：

print(df.to_string())
                      Total Addresses  topTenHolderRatio  topTwentyHolderRatio  topFiftyHolderRatio  topHundredHolderRatio
2021-11-24T00:00:00Z         39279627               5.25                  7.19                10.51                  13.26
2021-11-25T00:00:00Z         39255811               5.25                  7.19                10.49                  13.22
2021-11-26T00:00:00Z         39339840               5.25                  7.19                10.51                  13.24
2021-11-27T00:00:00Z         39391849               5.23                  7.11                10.45                  13.18
2021-11-28T00:00:00Z         39505340               5.24                  7.11                10.45                  13.18
2021-11-29T00:00:00Z         39502099               5.24                  7.11                10.43                  13.16
2021-11-30T00:00:00Z         39523000               5.24                  7.11                10.38                  13.12

您的另一个选择是数据以 json 格式位于 <script> 标签内。 S0 您也可以通过这种方式将其从初始请求站点中提取出来：

from bs4 import BeautifulSoup
import requests
import json
import re

url = 'https://coinmarketcap.com/currencies/bitcoin/holders/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search(r"({.*})", jsonStr).groups()[0]
jsonData = json.loads(jsonStr)['props']['initialProps']['pageProps']['info']['holders']

df = pd.DataFrame(jsonData).drop('holderList', axis=1).drop_duplicates()

输出：

print(df.to_string())
   holderCount  dailyActive  topTenHolderRatio  topTwentyHolderRatio  topFiftyHolderRatio  topHundredHolderRatio
0     39523000       963625               5.24                  7.11                10.38                  13.12

对于项目信息中的社会统计数据，它在特定的 api:

内

import requests
import pandas as pd

url = 'https://api.coinmarketcap.com/data-api/v3/project-info/detail?slug=bitcoin'
jsonData = requests.get(url).json()
socialStats = jsonData['data']['socialStats']

row = {}
for k, v in socialStats.items():
    if type(v) == dict:
        row.update(v)
    else:
        row.update({k:v})
        
df = pd.DataFrame([row])

输出：

print(df.to_string())
   cryptoId commits contributors  stars  forks watchers              lastCommitAt  members               updatedTime
0         1   31588          836  59687  30692     3881  2021-11-30T00:09:02.000Z  3617460  2021-11-30T16:00:02.365Z

Answer 2

由于 chitown88 已经提到内容是动态提供的，因此替代解决方案可能是 selenium。

如何select?

如果您使用 selenium 获得 html，您应该避免 select 元素 class，因为它们也是动态生成的。

Select <h3> taht 包含 持有人统计数据 及其所有 <span> 没有 child <span>：

soup.select('h3:-soup-contains("Holders Statistics") ~ div :not(span)')

示例基于 Selenium 4

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
options = webdriver.ChromeOptions()
service = ChromeService(executable_path='YOUR PATH TO CHROM DRIVER')
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://coinmarketcap.com/currencies/bitcoin/holders/')

soup = BeautifulSoup(driver.page_source,'html.parser')

data = {x.get_text('|').split('|')[0]:x.get_text('|').split('|')[1] for x in soup.select('h3:-soup-contains("Holders Statistics") ~ div :not(span)')}
print(data)

driver.close()

输出

{'Total Addresses': '39,523,000',
 'Active Addresses': '24h',
 'Top 10 Holders': '5.24%',
 'Top 20 Holders': '7.11%',
 'Top 50 Holders': '10.38%',
 'Top 100 Holders': '13.12%'}

Python - BeautifulSoup - 选择具有 'class' 属性的 'div' 显示 html 中的每个 div

Python - BeautifulSoup - Selecting a 'div' with 'class'-attribute shows every div in the html

beautifulsoup

web-crawler

如何select?

示例基于 Selenium 4

输出