我的网络抓取工具代码有什么问题 (python3.4)
What is wrong with my web scraper code (python3.4)
我正在尝试从网站上抓取 table。它运行但我没有得到我的文件的输出。我哪里错了?
代码:
from bs4 import BeautifulSoup
import urllib.request
f = open('nbapro.txt','w')
errorFile = open('nbaerror.txt','w')
page = urllib.request.urlopen('http://www.numberfire.com/nba/fantasy/full-fantasy-basketball-projections')
content = page.read()
soup = BeautifulSoup(content)
tableStats = soup.find('table', {'class': 'data-table xsmall'})
for row in tableStats.findAll('tr')[2:]:
col = row.findAll('td')
try:
name = col[0].a.string.strip()
f.write(name+'\n')
except Exception as e:
errorFile.write (str(e) + '******'+ str(col) + '\n')
pass
f.close
errorFile.close
问题是您尝试抓取的 table 数据是通过在浏览器端调用 javascript 代码来填充的。 urllib
不是浏览器,因此无法执行 javascript.
如果你想通过urllib
和BeautifulSoup
解决它,你必须从script
标签中提取JSON对象并通过[=15加载它=].例如,打印玩家姓名:
import json
import re
import urllib.request
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen('http://www.numberfire.com/nba/fantasy/full-fantasy-basketball-projections'))
script = soup.find('script', text=lambda x: x and 'NF_DATA' in x).text
data = re.search(r'NF_DATA = (.*?);', script).group(1)
data = json.loads(data)
for player_id, player in data['players'].items():
print(player['name'] + ' ' + player['last_name'])
我正在尝试从网站上抓取 table。它运行但我没有得到我的文件的输出。我哪里错了?
代码:
from bs4 import BeautifulSoup
import urllib.request
f = open('nbapro.txt','w')
errorFile = open('nbaerror.txt','w')
page = urllib.request.urlopen('http://www.numberfire.com/nba/fantasy/full-fantasy-basketball-projections')
content = page.read()
soup = BeautifulSoup(content)
tableStats = soup.find('table', {'class': 'data-table xsmall'})
for row in tableStats.findAll('tr')[2:]:
col = row.findAll('td')
try:
name = col[0].a.string.strip()
f.write(name+'\n')
except Exception as e:
errorFile.write (str(e) + '******'+ str(col) + '\n')
pass
f.close
errorFile.close
问题是您尝试抓取的 table 数据是通过在浏览器端调用 javascript 代码来填充的。 urllib
不是浏览器,因此无法执行 javascript.
如果你想通过urllib
和BeautifulSoup
解决它,你必须从script
标签中提取JSON对象并通过[=15加载它=].例如,打印玩家姓名:
import json
import re
import urllib.request
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen('http://www.numberfire.com/nba/fantasy/full-fantasy-basketball-projections'))
script = soup.find('script', text=lambda x: x and 'NF_DATA' in x).text
data = re.search(r'NF_DATA = (.*?);', script).group(1)
data = json.loads(data)
for player_id, player in data['players'].items():
print(player['name'] + ' ' + player['last_name'])