Scrapyd - URL 作为蜘蛛参数传递时的解析问题
Scrapyd - URL parsing problem when passed as a spider argument
我在我的 Spider class 中添加了以下代码,以便能够将 URL 作为参数传递:
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('target_url').replace('\', '')]
(replace函数是去掉终端转义引入的反斜杠)
蜘蛛识别 url,当我 运行 :
时开始解析并在本地完美关闭
scrapy crawl my_spider -a target_url="https://www.example.com/list.htm\?tri\=initial\&enterprise\=0\&idtypebien\=2,1\&pxMax\=1000000\&idtt\=2,5\&naturebien\=1,2,4\&ci\=910377"
然而,当我通过 scrapyd 做同样的事情时,我 运行:
curl https://my_spider.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm\?tri\=initial\&enterprise\=0\&idtypebien\=2,1\&pxMax\=1000000\&idtt\=2,5\&naturebien\=1,2,4\&ci\=910377"
我收到一个错误,因为 url 的解析方式与使用 scrapy crawl
时的解析方式不同。
日志:
2019-08-08 22:52:34 [scrapy.core.engine] INFO: Spider opened
2019-08-08 22:52:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-08 22:52:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-08 22:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/list.htm?tri=initial> (referer: http://www.example.com)
2019-08-08 22:52:34 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-08 22:52:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 267,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 35684,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.680357,
经过一些实验,我发现由于某种原因,当通过scrapyd将URL作为蜘蛛参数传递时,它会在到达[=时停止解析32=]& 字符。
关于如何补救此行为的任何见解?
我设法解决了我的问题。这是通过 cURL 发送 POST 请求 的方式,而不是 Scrapyd。
检查此请求后:
curl -v http://example.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377" --trace-ascii /dev/stdout
我得到了:
Warning: --trace-ascii overrides an earlier trace/verbose option
== Info: Trying 52.45.74.184...
== Info: TCP_NODELAY set
== Info: Connected to example.herokuapp.com (52.45.74.184) port 80 (#0)
=> Send header, 177 bytes (0xb1)
0000: POST /schedule.json HTTP/1.1
001e: Host: example.herokuapp.com
0043: User-Agent: curl/7.54.0
005c: Accept: */*
0069: Content-Length: 164
007e: Content-Type: application/x-www-form-urlencoded
00af:
=> Send data, 164 bytes (0xa4)
0000: project=default&spider=example&target_url=https://www.example.co
0040: m/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000
0080: &idtt=2,5&naturebien=1,2,4&ci=910377
== Info: upload completely sent off: 164 out of 164 bytes
显然,由于 POST 请求是这样发送的:
http://example.herokuapp.com/schedule.json?project=default&spider=example&target_url=https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
只要有&,就被认为是一个新的参数。因此,target_url 参数中包含的 URL 部分只有 https://www.example.com/list.htm?tri=initial
,其余部分被视为 POST 请求的另一个参数。
使用 Postman 并尝试以下 POST 请求后:
POST /schedule.json HTTP/1.1
Host: example.herokuapp.com
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW
cache-control: no-cache
Postman-Token: 004990ad-8f83-4208-8d36-529376b79643
Content-Disposition: form-data; name="project"
default
Content-Disposition: form-data; name="spider"
my_spider
Content-Disposition: form-data; name="target_url"
https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
------WebKitFormBoundary7MA4YWxkTrZu0gW--
成功了,作业在Scrapyd上启动成功!
通过 cURL,使用 -F 代替 -d 效果很好:
curl https://example.herokuapp.com/schedule.json -F project=default -F spider=my_spider -F target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377"
我在我的 Spider class 中添加了以下代码,以便能够将 URL 作为参数传递:
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('target_url').replace('\', '')]
(replace函数是去掉终端转义引入的反斜杠)
蜘蛛识别 url,当我 运行 :
时开始解析并在本地完美关闭scrapy crawl my_spider -a target_url="https://www.example.com/list.htm\?tri\=initial\&enterprise\=0\&idtypebien\=2,1\&pxMax\=1000000\&idtt\=2,5\&naturebien\=1,2,4\&ci\=910377"
然而,当我通过 scrapyd 做同样的事情时,我 运行:
curl https://my_spider.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm\?tri\=initial\&enterprise\=0\&idtypebien\=2,1\&pxMax\=1000000\&idtt\=2,5\&naturebien\=1,2,4\&ci\=910377"
我收到一个错误,因为 url 的解析方式与使用 scrapy crawl
时的解析方式不同。
日志:
2019-08-08 22:52:34 [scrapy.core.engine] INFO: Spider opened
2019-08-08 22:52:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-08 22:52:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-08 22:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/list.htm?tri=initial> (referer: http://www.example.com)
2019-08-08 22:52:34 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-08 22:52:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 267,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 35684,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.680357,
经过一些实验,我发现由于某种原因,当通过scrapyd将URL作为蜘蛛参数传递时,它会在到达[=时停止解析32=]& 字符。
关于如何补救此行为的任何见解?
我设法解决了我的问题。这是通过 cURL 发送 POST 请求 的方式,而不是 Scrapyd。
检查此请求后:
curl -v http://example.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377" --trace-ascii /dev/stdout
我得到了:
Warning: --trace-ascii overrides an earlier trace/verbose option
== Info: Trying 52.45.74.184...
== Info: TCP_NODELAY set
== Info: Connected to example.herokuapp.com (52.45.74.184) port 80 (#0)
=> Send header, 177 bytes (0xb1)
0000: POST /schedule.json HTTP/1.1
001e: Host: example.herokuapp.com
0043: User-Agent: curl/7.54.0
005c: Accept: */*
0069: Content-Length: 164
007e: Content-Type: application/x-www-form-urlencoded
00af:
=> Send data, 164 bytes (0xa4)
0000: project=default&spider=example&target_url=https://www.example.co
0040: m/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000
0080: &idtt=2,5&naturebien=1,2,4&ci=910377
== Info: upload completely sent off: 164 out of 164 bytes
显然,由于 POST 请求是这样发送的:
http://example.herokuapp.com/schedule.json?project=default&spider=example&target_url=https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
只要有&,就被认为是一个新的参数。因此,target_url 参数中包含的 URL 部分只有 https://www.example.com/list.htm?tri=initial
,其余部分被视为 POST 请求的另一个参数。
使用 Postman 并尝试以下 POST 请求后:
POST /schedule.json HTTP/1.1
Host: example.herokuapp.com
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW
cache-control: no-cache
Postman-Token: 004990ad-8f83-4208-8d36-529376b79643
Content-Disposition: form-data; name="project"
default
Content-Disposition: form-data; name="spider"
my_spider
Content-Disposition: form-data; name="target_url"
https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
------WebKitFormBoundary7MA4YWxkTrZu0gW--
成功了,作业在Scrapyd上启动成功!
通过 cURL,使用 -F 代替 -d 效果很好:
curl https://example.herokuapp.com/schedule.json -F project=default -F spider=my_spider -F target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377"