POST 以下方法在 scrapy 中不起作用
Below POST Method is not working in scrapy
我也尝试过 headers、cookies、Formdata 和 body,但我得到了 401 和 500 状态代码。在此站点中,第一页采用 GET 方法并给出 HTML 响应,其他页面采用 POST 方法并给出 JSON 响应。但是这些状态码是为了未经授权而到达的,但我已经搜索过,但我在网页 headers 中找不到任何 CSRF 令牌或授权令牌。
import scrapy
from SouthShore.items import Product
from scrapy.http import Request, FormRequest
class OjcommerceDSpider(scrapy.Spider):
handle_httpstatus_list = [401,500]
name = "ojcommerce_d"
allowed_domains = ["ojcommerce.com"]
#start_urls = ['http://www.ojcommerce.com/search?k=south%20shore%20furniture']
def start_requests(self):
return [FormRequest('http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method ="POST",
body = '''{"searchTitle" : "south shore furniture","pageIndex" : '2',"sortBy":"1"}''',
headers={'Content-Type': 'application/json; charset=UTF-8', 'Accept' : 'application/json, text/javascript, */*; q=0.01',
'Cookie' :'''vid=eAZZP6XwbmybjpTWQCLS+g==;
_ga=GA1.2.1154881264.1480509732;
ASP.NET_SessionId=rkklowbpaxzpp50btpira1yp'''},callback=self.parse)]
def parse(self,response):
with open("ojcommerce.json","wb") as f:
f.write(response.body)
我用下面的代码得到它:
import json
from scrapy import Request, Spider
class OjcommerceDSpider(Spider):
name = "ojcommerce"
allowed_domains = ["ojcommerce.com"]
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'COOKIES_DEBUG': True,
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
},
}
def start_requests(self):
yield Request(
url='http://www.ojcommerce.com/search?k=furniture',
callback=self.parse_search_page,
)
def parse_search_page(self, response):
yield Request(
url='http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method='POST',
body=json.dumps({'searchTitle': 'furniture', 'pageIndex': '2', 'sortBy': '1'}),
callback=self.parse_json_page,
headers={
'Content-Type': 'application/json; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
},
)
def parse_json_page(self,response):
data = json.loads(response.body)
with open('ojcommerce.json', 'wb') as f:
json.dump(data, f, indent=4)
两个观察结果:
- 需要先前对另一个网站页面的请求才能获得 "fresh"
ASP.NET_SessionId
cookie
- 我无法使用
FormRequest
使其工作,请改用 Request
。
我也尝试过 headers、cookies、Formdata 和 body,但我得到了 401 和 500 状态代码。在此站点中,第一页采用 GET 方法并给出 HTML 响应,其他页面采用 POST 方法并给出 JSON 响应。但是这些状态码是为了未经授权而到达的,但我已经搜索过,但我在网页 headers 中找不到任何 CSRF 令牌或授权令牌。
import scrapy
from SouthShore.items import Product
from scrapy.http import Request, FormRequest
class OjcommerceDSpider(scrapy.Spider):
handle_httpstatus_list = [401,500]
name = "ojcommerce_d"
allowed_domains = ["ojcommerce.com"]
#start_urls = ['http://www.ojcommerce.com/search?k=south%20shore%20furniture']
def start_requests(self):
return [FormRequest('http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method ="POST",
body = '''{"searchTitle" : "south shore furniture","pageIndex" : '2',"sortBy":"1"}''',
headers={'Content-Type': 'application/json; charset=UTF-8', 'Accept' : 'application/json, text/javascript, */*; q=0.01',
'Cookie' :'''vid=eAZZP6XwbmybjpTWQCLS+g==;
_ga=GA1.2.1154881264.1480509732;
ASP.NET_SessionId=rkklowbpaxzpp50btpira1yp'''},callback=self.parse)]
def parse(self,response):
with open("ojcommerce.json","wb") as f:
f.write(response.body)
我用下面的代码得到它:
import json
from scrapy import Request, Spider
class OjcommerceDSpider(Spider):
name = "ojcommerce"
allowed_domains = ["ojcommerce.com"]
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'COOKIES_DEBUG': True,
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
},
}
def start_requests(self):
yield Request(
url='http://www.ojcommerce.com/search?k=furniture',
callback=self.parse_search_page,
)
def parse_search_page(self, response):
yield Request(
url='http://www.ojcommerce.com/ajax/search.aspx/FetchDataforPaging',
method='POST',
body=json.dumps({'searchTitle': 'furniture', 'pageIndex': '2', 'sortBy': '1'}),
callback=self.parse_json_page,
headers={
'Content-Type': 'application/json; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
},
)
def parse_json_page(self,response):
data = json.loads(response.body)
with open('ojcommerce.json', 'wb') as f:
json.dump(data, f, indent=4)
两个观察结果:
- 需要先前对另一个网站页面的请求才能获得 "fresh"
ASP.NET_SessionId
cookie - 我无法使用
FormRequest
使其工作,请改用Request
。