更正 headers 和有效负载以抓取使用 ajax 的网站
Correct headers and payload for scraping a website that uses ajax
我正在尝试使用 scrapy FormRequest 模拟一个 ajax 请求,以获取该网站的下一页 https://www.the-academy.nl/trainingen。我的 headers 看起来像这样
headers = {
'path': 'https://www.the-academy.nl/Page?$$ajaxid=view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:tblView',
'authority': 'www.the-academy.nl',
'accept-encoding': 'gzip, deflate, br',
'content-length': '1225',
'content-type': 'multipart/form-data'
}
和这样的表单数据
formdata = {
'$$viewid': '!1rjej6ewgse3x0h6r86gfzlst!',
'$$xspsubmitid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager__Group__lnk__1',
'$$xspexecid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager',
'$$xspsubmitvalue':'',
'$$xspsubmitscroll': '0|1272',
}
我收到了回复,但它是 404 页面。
提前谢谢你)
我使用 java
作为搜索词。我 select 只有具有 key-value 对的表单数据。
不要注入'content-length'
header
添加方法:POST
致电FormRequest.from_response
下面是200响应状态的例子
脚本:
from scrapy.crawler import CrawlerProcess
import scrapy
class AspSpider(scrapy.Spider):
name = 'asp'
def start_requests(self):
yield scrapy.FormRequest(
url='https://www.the-academy.nl/zoekresultatenpagina?text=java',
formdata= {
'view:_id1:_id2:_id3:_id4:_id5:2:_id86:_id88:query': "",
'view:_id1:_id2:_id3:_id4:_id5:3:_id94:_id96:query': "",
'$viewid': '!eaie1cfxpuckx0dbjrxsxrw60!',
'$$xspsubmitid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager__Next',
'$$xspexecid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager',
'$$xspsubmitscroll': '0|1500',
'view:_id1': 'view:_id1',
'$$xspsubmitvalue': ""
},
callback=self.parse_item,
headers={
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'multipart/form-data; boundary=----WebKitFormBoundary2aCYMIdAcbwx4FjO',
'referer': 'https://www.the-academy.nl/zoekresultatenpagina?text=java'
},
method='POST'
)
def parse_item(self,response):
pass
if __name__ == "__main__":
process =CrawlerProcess(AspSpider)
process.crawl()
process.start()
输出:
DEBUG: Crawled (200) <POST https://www3.hkexnews.hk/sdw/search/searchsdw.aspx> (referer: https://www.the-academy.nl/zoekresultatenpagina?text=java)
我正在尝试使用 scrapy FormRequest 模拟一个 ajax 请求,以获取该网站的下一页 https://www.the-academy.nl/trainingen。我的 headers 看起来像这样
headers = {
'path': 'https://www.the-academy.nl/Page?$$ajaxid=view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:tblView',
'authority': 'www.the-academy.nl',
'accept-encoding': 'gzip, deflate, br',
'content-length': '1225',
'content-type': 'multipart/form-data'
}
和这样的表单数据
formdata = {
'$$viewid': '!1rjej6ewgse3x0h6r86gfzlst!',
'$$xspsubmitid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager__Group__lnk__1',
'$$xspexecid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager',
'$$xspsubmitvalue':'',
'$$xspsubmitscroll': '0|1272',
}
我收到了回复,但它是 404 页面。 提前谢谢你)
我使用
java
作为搜索词。我 select 只有具有 key-value 对的表单数据。不要注入
'content-length'
header添加方法:POST
致电
FormRequest.from_response
下面是200响应状态的例子
脚本:
from scrapy.crawler import CrawlerProcess
import scrapy
class AspSpider(scrapy.Spider):
name = 'asp'
def start_requests(self):
yield scrapy.FormRequest(
url='https://www.the-academy.nl/zoekresultatenpagina?text=java',
formdata= {
'view:_id1:_id2:_id3:_id4:_id5:2:_id86:_id88:query': "",
'view:_id1:_id2:_id3:_id4:_id5:3:_id94:_id96:query': "",
'$viewid': '!eaie1cfxpuckx0dbjrxsxrw60!',
'$$xspsubmitid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager__Next',
'$$xspexecid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager',
'$$xspsubmitscroll': '0|1500',
'view:_id1': 'view:_id1',
'$$xspsubmitvalue': ""
},
callback=self.parse_item,
headers={
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'multipart/form-data; boundary=----WebKitFormBoundary2aCYMIdAcbwx4FjO',
'referer': 'https://www.the-academy.nl/zoekresultatenpagina?text=java'
},
method='POST'
)
def parse_item(self,response):
pass
if __name__ == "__main__":
process =CrawlerProcess(AspSpider)
process.crawl()
process.start()
输出:
DEBUG: Crawled (200) <POST https://www3.hkexnews.hk/sdw/search/searchsdw.aspx> (referer: https://www.the-academy.nl/zoekresultatenpagina?text=java)