尝试从基于 ajax 的网站抓取数据时,如何使用 Scrapy 模拟 xhr 请求?
How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?
我是第一次使用Scrapy爬取网页,不幸的是选择了一个动态的开始...
我已经成功爬取部分(120个链接),感谢有人帮助我here, but not links in target website
经过一些研究,我知道抓取 ajax 网页与那些简单的想法没有什么不同:
•打开浏览器开发者工具,网络选项卡
•转到目标网站
•单击提交按钮,查看发送到服务器的 XHR 请求
•在您的蜘蛛中模拟此 XHR 请求
虽然最后一个听起来很晦涩---如何模拟XHR请求?
我见过有人用'headers'或'formdata'等参数来模拟。不明白这是什么意思。
这是我的部分代码:
class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def start_request(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)
def parse(self,response):
links = response.xpath("//a/@href").extract()
crawledLinks = [ ]
LinkPattern = re.compile("^/store/apps/details\?id=.")
for link in links:
if LinkPattern.match(link) and not link in crawledLinks:
crawledLinks.append("http://play.google.com"+link+"#release")
for link in crawledLinks:
yield scrapy.Request(link, callback=self.parse_every_app)
def parse_every_app(self,response):
start_request在这里似乎没有任何作用。如果我删除它们,蜘蛛仍然会抓取相同数量的链接。
我已经解决这个问题一个星期了...如果你能帮助我,我将不胜感激...
试试这个:
class googleAppSpider(Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def parse(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)
def data_parse(self,response):
item = googleAppItem()
map = {}
links = response.xpath("//a/@href").re(r'/store/apps/details.*')
for l in links:
if l not in map:
map[l] = True
item['url'] = l
yield item
使用 scrapy crawl -o links.csv
或 scrapy crawl -o links.json
抓取蜘蛛,您将获得 csv 文件或 json 文件中的所有链接。要增加要抓取的页面数量,请更改 for 循环的范围。
我是第一次使用Scrapy爬取网页,不幸的是选择了一个动态的开始...
我已经成功爬取部分(120个链接),感谢有人帮助我here, but not links in target website
经过一些研究,我知道抓取 ajax 网页与那些简单的想法没有什么不同:
•打开浏览器开发者工具,网络选项卡
•转到目标网站
•单击提交按钮,查看发送到服务器的 XHR 请求
•在您的蜘蛛中模拟此 XHR 请求
虽然最后一个听起来很晦涩---如何模拟XHR请求?
我见过有人用'headers'或'formdata'等参数来模拟。不明白这是什么意思。
这是我的部分代码:
class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def start_request(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)
def parse(self,response):
links = response.xpath("//a/@href").extract()
crawledLinks = [ ]
LinkPattern = re.compile("^/store/apps/details\?id=.")
for link in links:
if LinkPattern.match(link) and not link in crawledLinks:
crawledLinks.append("http://play.google.com"+link+"#release")
for link in crawledLinks:
yield scrapy.Request(link, callback=self.parse_every_app)
def parse_every_app(self,response):
start_request在这里似乎没有任何作用。如果我删除它们,蜘蛛仍然会抓取相同数量的链接。
我已经解决这个问题一个星期了...如果你能帮助我,我将不胜感激...
试试这个:
class googleAppSpider(Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']
def parse(self,response):
for i in range(0,10):
yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)
def data_parse(self,response):
item = googleAppItem()
map = {}
links = response.xpath("//a/@href").re(r'/store/apps/details.*')
for l in links:
if l not in map:
map[l] = True
item['url'] = l
yield item
使用 scrapy crawl -o links.csv
或 scrapy crawl -o links.json
抓取蜘蛛,您将获得 csv 文件或 json 文件中的所有链接。要增加要抓取的页面数量,请更改 for 循环的范围。