如何处理来自 NBA.com 的数据?
How to work with data from NBA.com?
我找到了 Greg Reda 的博客 post 关于从 nba.com 抓取 HTML:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
我尝试使用他在那里写的代码:
import requests
import json
url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
'=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
response = requests.get(url)
response.raise_for_status()
shots = response.json()['resultSets']['rowSet']
avg_percentage = shots['OPP_FG_PCT']
print(avg_percentage)
但是 returns:
Traceback (most recent call last):
File "C:\Python34\nba.py", line 91, in <module>
avg_percentage = shots['OPP_FG_PCT']
TypeError: list indices must be integers, not str
我只知道基本的 Python 所以我不知道如何从数据中获取整数列表。谁能解释一下?
显然,自 Greg Reda 写下 post 以来,数据结构发生了变化。在探索数据之前,我建议您通过酸洗将其保存到文件中。这样您就不必在每次修改和重新运行脚本时一直访问 NBA 服务器并等待下载。
以下脚本检查 pickled 数据是否存在以避免不必要的下载:
import requests
import json
url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
'=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
print(url)
import sys, os, pickle
file_name = 'result_sets.pickled'
if os.path.isfile(file_name):
result_sets = pickle.load(open(file_name, 'rb'))
else:
response = requests.get(url)
response.raise_for_status()
result_sets = response.json()['resultSets']
pickle.dump(result_sets, open(file_name, 'wb'))
print(result_sets.keys())
print(result_sets['headers'][1])
print(result_sets['rowSet'][0])
print(len(result_sets['rowSet']))
一旦您掌握了 result_sets
,您就可以检查数据了。如果你打印它,你会发现它是一本字典。您可以提取字典键:
print(result_sets.keys())
目前键值是 'headers'
、'rowSet'
和 'name'
。您可以检查 headers:
print(result_sets['headers'])
我对这些统计数据的了解可能比你少。但是,通过查看数据,我发现 result_sets['rowSet']
包含 30 行,每行 23 个元素。 23 列由 result_sets['headers'][1]
标识。试试这个:
print(result_sets['headers'][1])
这将显示 23 个列名。现在来看第一行球队数据:
print(result_sets['rowSet'][0])
现在您可以看到为亚特兰大老鹰队报告的 23 个值。您可以遍历 result_sets['rowSet']
中的行以提取您感兴趣的任何值并计算汇总信息,例如总计和平均值。
我找到了 Greg Reda 的博客 post 关于从 nba.com 抓取 HTML:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
我尝试使用他在那里写的代码:
import requests
import json
url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
'=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
response = requests.get(url)
response.raise_for_status()
shots = response.json()['resultSets']['rowSet']
avg_percentage = shots['OPP_FG_PCT']
print(avg_percentage)
但是 returns:
Traceback (most recent call last):
File "C:\Python34\nba.py", line 91, in <module>
avg_percentage = shots['OPP_FG_PCT']
TypeError: list indices must be integers, not str
我只知道基本的 Python 所以我不知道如何从数据中获取整数列表。谁能解释一下?
显然,自 Greg Reda 写下 post 以来,数据结构发生了变化。在探索数据之前,我建议您通过酸洗将其保存到文件中。这样您就不必在每次修改和重新运行脚本时一直访问 NBA 服务器并等待下载。
以下脚本检查 pickled 数据是否存在以避免不必要的下载:
import requests
import json
url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
'=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
print(url)
import sys, os, pickle
file_name = 'result_sets.pickled'
if os.path.isfile(file_name):
result_sets = pickle.load(open(file_name, 'rb'))
else:
response = requests.get(url)
response.raise_for_status()
result_sets = response.json()['resultSets']
pickle.dump(result_sets, open(file_name, 'wb'))
print(result_sets.keys())
print(result_sets['headers'][1])
print(result_sets['rowSet'][0])
print(len(result_sets['rowSet']))
一旦您掌握了 result_sets
,您就可以检查数据了。如果你打印它,你会发现它是一本字典。您可以提取字典键:
print(result_sets.keys())
目前键值是 'headers'
、'rowSet'
和 'name'
。您可以检查 headers:
print(result_sets['headers'])
我对这些统计数据的了解可能比你少。但是,通过查看数据,我发现 result_sets['rowSet']
包含 30 行,每行 23 个元素。 23 列由 result_sets['headers'][1]
标识。试试这个:
print(result_sets['headers'][1])
这将显示 23 个列名。现在来看第一行球队数据:
print(result_sets['rowSet'][0])
现在您可以看到为亚特兰大老鹰队报告的 23 个值。您可以遍历 result_sets['rowSet']
中的行以提取您感兴趣的任何值并计算汇总信息,例如总计和平均值。