Scrapy post 请求被重定向到错误的页面
Scrapy post request getting redirected to a wrong page
我正在尝试访问 this site
的详细信息页面
要从网络上访问,应单击 1. Consula Titlulo 2. Select 来自 Minerals 下拉列表的 ORO 和 3. 单击 Buscar。 4. 然后单击列表中的第一项。
开发工具和 Fiddler 显示我应该使用项目 ID 作为有效载荷发出 POST 请求,然后此 POST 请求被重定向到详细信息页面。
在我的例子中,我被重定向到主页。我想念什么?
这是我的 Scrapy 蜘蛛。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.shell import inspect_response
class CodeSpider(scrapy.Spider):
name = "col"
start_urls =['http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc']
headers ={
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Origin": "http://www.cmc.gov.co:8080",
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer":"http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",
}
def parse(self, response):
inspect_response(response, self)
payload = {'expediente': '29', 'tipoSolicitud': ''}
url = 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc'
yield scrapy.FormRequest(url, formdata = payload, headers=self.headers, callback = self.parse, dont_filter=True)
这是带有重定向的日志。
2018-08-23 13:58:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <POST http://
www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc>
2018-08-23 13:58:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
据我所知,scrapy 在发送请求之前也会分配正确的 Cookie。
In [2]: request.headers
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8,uk;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'PHPSESSID=1um6r67md5qpdcqs9g2n15g605',
'Dnt': '1',
'Origin': 'http://www.cmc.gov.co:8080',
'Referer': 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36'}
我错过了什么?
此外,如果我将 Postman 代码与 GET 一起用于详细信息页面,它工作正常并且 returns 页面。
Scrapy 重定向中的相同代码。
In [1]: url = "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc"^M
...: ^M
...: headers = {^M
...: 'upgrade-insecure-requests': "1",^M
...: 'user-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",^M
...: 'dnt': "1",^M
...: 'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",^M
...: 'referer': "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",^M
...: 'accept-encoding': "gzip, deflate",^M
...: 'accept-language': "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",^M
...: 'cookie': "PHPSESSID=2ba8dsre6l42un95qu33k09ud6",^M
...: 'cache-control': "no-cache",^M
...: ^M
...: }^M
...:
In [2]: fetch(url, headers=headers)
2018-08-23 14:47:13 [scrapy.core.engine] INFO: Spider opened
2018-08-23 14:47:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <GET http://w
ww.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc>
2018-08-23 14:47:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
看来我一开始就错过了 POST 请求。此 post 请求生成正确的会话 ID,该 ID 对于其他所有搜索都是新的。
我正在尝试访问 this site
的详细信息页面要从网络上访问,应单击 1. Consula Titlulo 2. Select 来自 Minerals 下拉列表的 ORO 和 3. 单击 Buscar。 4. 然后单击列表中的第一项。
开发工具和 Fiddler 显示我应该使用项目 ID 作为有效载荷发出 POST 请求,然后此 POST 请求被重定向到详细信息页面。
在我的例子中,我被重定向到主页。我想念什么?
这是我的 Scrapy 蜘蛛。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.shell import inspect_response
class CodeSpider(scrapy.Spider):
name = "col"
start_urls =['http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc']
headers ={
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Origin": "http://www.cmc.gov.co:8080",
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer":"http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",
}
def parse(self, response):
inspect_response(response, self)
payload = {'expediente': '29', 'tipoSolicitud': ''}
url = 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc'
yield scrapy.FormRequest(url, formdata = payload, headers=self.headers, callback = self.parse, dont_filter=True)
这是带有重定向的日志。
2018-08-23 13:58:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <POST http://
www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc>
2018-08-23 13:58:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
据我所知,scrapy 在发送请求之前也会分配正确的 Cookie。
In [2]: request.headers
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8,uk;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'PHPSESSID=1um6r67md5qpdcqs9g2n15g605',
'Dnt': '1',
'Origin': 'http://www.cmc.gov.co:8080',
'Referer': 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36'}
我错过了什么?
此外,如果我将 Postman 代码与 GET 一起用于详细信息页面,它工作正常并且 returns 页面。 Scrapy 重定向中的相同代码。
In [1]: url = "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc"^M
...: ^M
...: headers = {^M
...: 'upgrade-insecure-requests': "1",^M
...: 'user-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",^M
...: 'dnt': "1",^M
...: 'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",^M
...: 'referer': "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",^M
...: 'accept-encoding': "gzip, deflate",^M
...: 'accept-language': "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",^M
...: 'cookie': "PHPSESSID=2ba8dsre6l42un95qu33k09ud6",^M
...: 'cache-control': "no-cache",^M
...: ^M
...: }^M
...:
In [2]: fetch(url, headers=headers)
2018-08-23 14:47:13 [scrapy.core.engine] INFO: Spider opened
2018-08-23 14:47:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <GET http://w
ww.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc>
2018-08-23 14:47:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
看来我一开始就错过了 POST 请求。此 post 请求生成正确的会话 ID,该 ID 对于其他所有搜索都是新的。