使用 python 通过分页 table 抓取数据
Scraping data through paginated table using python
我正在通过 google 财务的股票历史页面 (http://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=PLfUVIDTDuSRiQKhwYGQBQ) 抓取数据。
我可以抓取当前页面上的 30 行。我面临的问题是我无法抓取 table(31-241 行)中的其余数据。如何转到下一页或 link。
以下是我的代码:
import urllib2
import xlwt #to write into excel spreadsheet
from bs4 import BeautifulSoup
# Main Coding Section
stock_links = open('stock_link_list.txt', 'r') #opening text file for reading
#url="https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=zHXOVLPnApG2iALxxYCADQ"
for url in stock_links:
OurFile = urllib2.urlopen(url)
OurHtml = OurFile.read()
OurFile.close()
soup = BeautifulSoup(OurHtml)
#soup1 = soup.find("div", {"class": "gf-table-wrapper sfe-break-bottom-16"}).get_text()
soup1 = soup.find("table", {"class": "gf-table historical_price"}).get_text()
end = url.index('&')
filename = url[47:end]
file = open(filename, 'w') #opening text file for writing
file.write(soup1)
#file.write(soup1.get_text()) #writing to the text file
file.close() #closing the text file
您必须对其进行微调,我会发现更具体的错误,但您可以继续增加 start
以获得下一个数据:
url = "https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=W8LUVLHnAoOswAOFs4DACg&start={}&num=30"
from bs4 import BeautifulSoup
import requests
# Main Coding Sectio
start = 0
while True:
try:
nxt = url.format(start)
r = requests.get(nxt)
soup = BeautifulSoup(r.content)
print(soup.find("table",{"class": "gf-table historical_price"}).get_text())
except Exception as e:
print(e)
break
start += 30
这会获取截止到 2 月 7 日的所有 table 数据:
......
Date
Open
High
Low
Close
Volume
Feb 7, 2014
552.60
557.90
548.25
551.50
119,711
乍一看,Row Limit
选项允许每页最多显示 30 行,但我手动将查询字符串参数更改为更大的数字,并意识到我们每页最多可以查看 200 行
将 URL 更改为
https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=OM3UVLFtkLnzBsjIgYAI&start=0&num=200
它将显示 200 行
然后改start=200&num=400
但更合乎逻辑的是,如果您有许多其他类型的链接。
然后你可以抓取分页区域,最后一个 TR 并抓取下一页的链接并抓取
我正在通过 google 财务的股票历史页面 (http://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=PLfUVIDTDuSRiQKhwYGQBQ) 抓取数据。
我可以抓取当前页面上的 30 行。我面临的问题是我无法抓取 table(31-241 行)中的其余数据。如何转到下一页或 link。 以下是我的代码:
import urllib2
import xlwt #to write into excel spreadsheet
from bs4 import BeautifulSoup
# Main Coding Section
stock_links = open('stock_link_list.txt', 'r') #opening text file for reading
#url="https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=zHXOVLPnApG2iALxxYCADQ"
for url in stock_links:
OurFile = urllib2.urlopen(url)
OurHtml = OurFile.read()
OurFile.close()
soup = BeautifulSoup(OurHtml)
#soup1 = soup.find("div", {"class": "gf-table-wrapper sfe-break-bottom-16"}).get_text()
soup1 = soup.find("table", {"class": "gf-table historical_price"}).get_text()
end = url.index('&')
filename = url[47:end]
file = open(filename, 'w') #opening text file for writing
file.write(soup1)
#file.write(soup1.get_text()) #writing to the text file
file.close() #closing the text file
您必须对其进行微调,我会发现更具体的错误,但您可以继续增加 start
以获得下一个数据:
url = "https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=W8LUVLHnAoOswAOFs4DACg&start={}&num=30"
from bs4 import BeautifulSoup
import requests
# Main Coding Sectio
start = 0
while True:
try:
nxt = url.format(start)
r = requests.get(nxt)
soup = BeautifulSoup(r.content)
print(soup.find("table",{"class": "gf-table historical_price"}).get_text())
except Exception as e:
print(e)
break
start += 30
这会获取截止到 2 月 7 日的所有 table 数据:
......
Date
Open
High
Low
Close
Volume
Feb 7, 2014
552.60
557.90
548.25
551.50
119,711
乍一看,Row Limit
选项允许每页最多显示 30 行,但我手动将查询字符串参数更改为更大的数字,并意识到我们每页最多可以查看 200 行
将 URL 更改为
https://www.google.com/finance/historical?q=NSE%3ASIEMENS&ei=OM3UVLFtkLnzBsjIgYAI&start=0&num=200
它将显示 200 行
然后改start=200&num=400
但更合乎逻辑的是,如果您有许多其他类型的链接。
然后你可以抓取分页区域,最后一个 TR 并抓取下一页的链接并抓取