从网站抓取 JSON 数据时出现问题

Question

我正在尝试抓取此网站以获取 table 中的数据：https://investor.vanguard.com/etf/profile/overview/ESGV/portfolio-holdings

我检查了该网站，发现数据来自 JSON table 通过外部 link。这是我的代码试图通过 headers 和 payloads:

来定位 link

import pandas as pd
import requests
import scraper_helper

headers = """ XXX """
headers = scraper_helper.get_dict(headers,strip_cookie=False)

url = 'https://api.vanguard.com/rs/ire/01/ind/fund/4393/portfolio-holding/stock.jsonp'
payload = {
'callback': 'angular.callbacks._m',
'planId': 'null',
'asOfType': 'daily',
'start': '1',
'count': '1527'}

jsonData = requests.get(url, params=payload).json()
results = jsonData['fund']['entity']

df2 = pd.json_normalize(results, record_path=['portfolioHolding'])
df2 = pd.DataFrame(df2,index=list(range(len(df2))))
print(df2)

在浏览器中手动单击 link 时，会弹出一个错误。 “很抱歉。无法找到您请求的页面。”这通常没有问题。我抓取了几个网站，其中 JSON 数据 link 在浏览器上显示为错误，但在 Python 中仍然有效。然而这次，错误也出现在 Python 中。由于某种原因我无法绕过它。

我该如何解决这个问题？谢谢！

Answer 1

他们的端点似乎需要将 Referer header 设置为 https://investor.vanguard.com/。

试试这个：

requests.get(url, params=payload, headers={ 'Referer': 'https://investor.vanguard.com/' }).text

我注意到响应不完全是 JSON，JSON 包含在 angular.callbacks._m( … ) 中。

从网站抓取 JSON 数据时出现问题

Problem with scraping JSON data from website

python

json

web-scraping

python-requests