如何识别请求中需要发送的关键信息?
How to identify a request's crucial information that needs to be sent?
我想从 this website 中获取一些票价,它使用自动完成请求。
这是我的代码:
import scrapy
from scrapy.http import Request, FormRequest
import urllib
class CabforceSpider(scrapy.Spider):
name = 'cabforce'
start_urls = ['https://www.cabforce.com']
complete_url = 'https://www.cabforce.com/v1/geo/autocomplete'
def parse(self, response):
payload = {
'chnl': 'cforce',
'complete': 'Barcelona Airport',
'destination': 'Barcelona'
}
return Request(
self.complete_url,
self.print_json,
method='POST',
body=urllib.urlencode(payload),
headers={'X-Requested-With': 'XMLHttpRequest'})
def print_json(self, response):
print response.body
很遗憾,我的回复是这样的:
{"status":"ArgumentError","reason":"Cannot validate input","description":null,"reasonType":2000,"details":[]}
如何找出缺少但需要随请求一起发送的信息?我考虑了 JSESSIONID 和版本,但我不知道该怎么做。
感谢您的任何提示,祝您有愉快的一天!
可能在表单中隐藏了您未提交的数据。使用 FormRequest
对象而不是简单的 Request
。此请求将自动填充所有字段,您可以仅覆盖您想要更改的字段。
您甚至不需要随请求一起发送 cookie。问题在于
body=urllib.urlencode(payload),
这会将正文编码为 URL 格式,但是如果您查看浏览器请求的正文,您会看到 JSON 是正文。
所以解决方案是 import json
并将上面提到的行更改为这一行:
body=json.dumps(payload),
在这种情况下,我通过您的蜘蛛得到以下结果:
{"status":"Ok","result":{"autocomplete":{"elements":[{"type":16,"description":"(BCN) - Barcelona Airport, Barcelona, Spain","location":{"lat":41.289545,"lng":2.072639},"raw":{"name":"(BCN) - Barcelona Airport","city":"Barcelona","country":"Spain"}},{"location":{"lat":41.3181887517739,"lng":2.07441323388724},"description":"Barcelona Airport Hotel, Plaza Volatería, 3, El Prat de Llobregat, Spain","raw":{"name":"Barcelona Airport Hotel","city":"El Prat de Llobregat","country":"Spain"},"type":4},{"location":{"lat":41.3176275,"lng":2.0249774},"description":"Airport Barcelona Apartments, Rafael Casanova, 37, Viladecans, Spain","raw":{"name":"Airport Barcelona Apartments","city":"Viladecans","country":"Spain"},"type":4}]}}}
我想从 this website 中获取一些票价,它使用自动完成请求。
这是我的代码:
import scrapy
from scrapy.http import Request, FormRequest
import urllib
class CabforceSpider(scrapy.Spider):
name = 'cabforce'
start_urls = ['https://www.cabforce.com']
complete_url = 'https://www.cabforce.com/v1/geo/autocomplete'
def parse(self, response):
payload = {
'chnl': 'cforce',
'complete': 'Barcelona Airport',
'destination': 'Barcelona'
}
return Request(
self.complete_url,
self.print_json,
method='POST',
body=urllib.urlencode(payload),
headers={'X-Requested-With': 'XMLHttpRequest'})
def print_json(self, response):
print response.body
很遗憾,我的回复是这样的:
{"status":"ArgumentError","reason":"Cannot validate input","description":null,"reasonType":2000,"details":[]}
如何找出缺少但需要随请求一起发送的信息?我考虑了 JSESSIONID 和版本,但我不知道该怎么做。 感谢您的任何提示,祝您有愉快的一天!
可能在表单中隐藏了您未提交的数据。使用 FormRequest
对象而不是简单的 Request
。此请求将自动填充所有字段,您可以仅覆盖您想要更改的字段。
您甚至不需要随请求一起发送 cookie。问题在于
body=urllib.urlencode(payload),
这会将正文编码为 URL 格式,但是如果您查看浏览器请求的正文,您会看到 JSON 是正文。
所以解决方案是 import json
并将上面提到的行更改为这一行:
body=json.dumps(payload),
在这种情况下,我通过您的蜘蛛得到以下结果:
{"status":"Ok","result":{"autocomplete":{"elements":[{"type":16,"description":"(BCN) - Barcelona Airport, Barcelona, Spain","location":{"lat":41.289545,"lng":2.072639},"raw":{"name":"(BCN) - Barcelona Airport","city":"Barcelona","country":"Spain"}},{"location":{"lat":41.3181887517739,"lng":2.07441323388724},"description":"Barcelona Airport Hotel, Plaza Volatería, 3, El Prat de Llobregat, Spain","raw":{"name":"Barcelona Airport Hotel","city":"El Prat de Llobregat","country":"Spain"},"type":4},{"location":{"lat":41.3176275,"lng":2.0249774},"description":"Airport Barcelona Apartments, Rafael Casanova, 37, Viladecans, Spain","raw":{"name":"Airport Barcelona Apartments","city":"Viladecans","country":"Spain"},"type":4}]}}}