更正 headers 和有效负载以抓取使用 ajax 的网站

Correct headers and payload for scraping a website that uses ajax

我正在尝试使用 scrapy FormRequest 模拟一个 ajax 请求,以获取该网站的下一页 https://www.the-academy.nl/trainingen。我的 headers 看起来像这样

headers = {
        'path': 'https://www.the-academy.nl/Page?$$ajaxid=view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:tblView',
        'authority': 'www.the-academy.nl',
        'accept-encoding': 'gzip, deflate, br',
        'content-length': '1225',
        'content-type': 'multipart/form-data'
    }

和这样的表单数据

formdata = {
        '$$viewid': '!1rjej6ewgse3x0h6r86gfzlst!',
        '$$xspsubmitid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager__Group__lnk__1',
        '$$xspexecid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager',
        '$$xspsubmitvalue':'',
        '$$xspsubmitscroll': '0|1272',
    }

我收到了回复,但它是 404 页面。 提前谢谢你)

  1. 我使用 java 作为搜索词。我 select 只有具有 key-value 对的表单数据。

  2. 不要注入'content-length'header

  3. 添加方法:POST

  4. 致电FormRequest.from_response

  5. 下面是200响应状态的例子

脚本:

from scrapy.crawler import CrawlerProcess
import scrapy
class AspSpider(scrapy.Spider):
    name = 'asp'
    
    def start_requests(self):
        yield scrapy.FormRequest(
          
            url='https://www.the-academy.nl/zoekresultatenpagina?text=java',
            formdata= {
                'view:_id1:_id2:_id3:_id4:_id5:2:_id86:_id88:query': "",
                'view:_id1:_id2:_id3:_id4:_id5:3:_id94:_id96:query': "",
                '$viewid': '!eaie1cfxpuckx0dbjrxsxrw60!',
                '$$xspsubmitid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager__Next',
                '$$xspexecid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager',
                '$$xspsubmitscroll': '0|1500',
                'view:_id1': 'view:_id1',
                '$$xspsubmitvalue': ""
                },
            callback=self.parse_item,
            headers={
                'accept': '*/*',
                'accept-encoding': 'gzip, deflate, br',
                'accept-language': 'en-US,en;q=0.9',
                'content-type': 'multipart/form-data; boundary=----WebKitFormBoundary2aCYMIdAcbwx4FjO',
                'referer': 'https://www.the-academy.nl/zoekresultatenpagina?text=java'
            },
            method='POST'

            )
    def parse_item(self,response):
        pass
if __name__ == "__main__":
    process =CrawlerProcess(AspSpider)
    process.crawl()
    process.start()

输出:

 DEBUG: Crawled (200) <POST https://www3.hkexnews.hk/sdw/search/searchsdw.aspx> (referer: https://www.the-academy.nl/zoekresultatenpagina?text=java)