尝试从基于 ajax 的网站抓取数据时，如何使用 Scrapy 模拟 xhr 请求？

Question

我是第一次使用Scrapy爬取网页，不幸的是选择了一个动态的开始...

我已经成功爬取部分（120个链接），感谢有人帮助我here, but not links in target website

经过一些研究，我知道抓取 ajax 网页与那些简单的想法没有什么不同：

•打开浏览器开发者工具，网络选项卡

•转到目标网站

•单击提交按钮，查看发送到服务器的 XHR 请求

•在您的蜘蛛中模拟此 XHR 请求

虽然最后一个听起来很晦涩---如何模拟XHR请求？

我见过有人用'headers'或'formdata'等参数来模拟。不明白这是什么意思。

这是我的部分代码：

class googleAppSpider(scrapy.Spider):
name = "googleApp"
allowed_domains = ['play.google.com']
start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

def start_request(self,response):
    for i in range(0,10): 
        yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i+60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.parse)

def parse(self,response):
    links = response.xpath("//a/@href").extract()
    crawledLinks = [ ]
    LinkPattern = re.compile("^/store/apps/details\?id=.")
    for link in links:
        if LinkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append("http://play.google.com"+link+"#release")
    for link in crawledLinks:
            yield scrapy.Request(link, callback=self.parse_every_app)

def parse_every_app(self,response):

start_request在这里似乎没有任何作用。如果我删除它们，蜘蛛仍然会抓取相同数量的链接。

我已经解决这个问题一个星期了...如果你能帮助我，我将不胜感激...

Answer 1

试试这个：

class googleAppSpider(Spider):
    name = "googleApp"
    allowed_domains = ['play.google.com']
    start_urls = ['https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0']

    def parse(self,response):
        for i in range(0,10): 
            yield FormRequest(url="https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0", method="POST", formdata={'start':str(i*60),'num':'60','numChildren':'0','ipf':'1','xhr':'1','token':'m1VdlomIcpZYfkJT5dktVuqLw2k:1455483261011'}, callback=self.data_parse)

    def data_parse(self,response):
        item = googleAppItem()
        map = {}
        links = response.xpath("//a/@href").re(r'/store/apps/details.*')
        for l in links:
            if l not in map:
                map[l] = True
                item['url'] = l
                yield item

使用 scrapy crawl -o links.csv 或 scrapy crawl -o links.json 抓取蜘蛛，您将获得 csv 文件或 json 文件中的所有链接。要增加要抓取的页面数量，请更改 for 循环的范围。

尝试从基于 ajax 的网站抓取数据时，如何使用 Scrapy 模拟 xhr 请求？

How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

python

ajax

form-data

web-crawler

scrapy