Scrapy 将 %0A 添加到 URL,导致它们失败
Scrapy adds %0A to URLs, causing them to fail
我对这个几乎束手无策。基本上我有一个 url 似乎有点神奇。具体是这样的:
https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031
当我用requests
点击它时,一切正常:
import requests
test = requests.get("https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031")
<Response [200]>
但是,当我使用scrapy
时,会弹出以下行:
Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
我什至尝试更新我的用户代理字符串,但无济于事。我的一部分担心 URL-encoding %0A
是负责任的,但这似乎很奇怪,我找不到任何关于如何修复它的文档。
作为参考,这是我发送请求的方式,但我不确定这会增加多少信息:
for url in review_urls:
yield scrapy.Request(url, callback=self.get_review_urls)
重要的是要注意这是例外情况而不是规则。大多数 URL 都可以不受阻碍地工作,但这些极端情况并不少见。
我不认为这是 scrapy 的问题,我怀疑你的 review_urls
、
有问题
请从 scrapy-shell
中找到此演示,在 url-encoding
[=15] 期间,您的 url 以换行符结束(文档 here) =] 转换为 %0A
。似乎您不小心在 url 末尾添加了换行符,或者提取的 url 包含额外的换行符。
scrapy shell 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031'
2015-08-02 05:48:56 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-08-02 05:48:56 [scrapy] INFO: Optional features available: ssl, http11
2015-08-02 05:48:56 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-08-02 05:48:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-08-02 05:48:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-02 05:48:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-02 05:48:56 [scrapy] INFO: Enabled item pipelines:
2015-08-02 05:48:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-02 05:48:56 [scrapy] INFO: Spider opened
2015-08-02 05:48:58 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:48:59 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s] item {}
[s] request <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] response <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] settings <scrapy.settings.Settings object at 0x7fe365b91c50>
[s] spider <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2015-08-02 05:48:59 [root] DEBUG: Using default logger
2015-08-02 05:48:59 [root] DEBUG: Using default logger
In [1]: url = 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031\n'
In [2]: fetch(url)
2015-08-02 05:49:24 [scrapy] DEBUG: Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s] item {}
[s] request <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s] response <404 https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s] settings <scrapy.settings.Settings object at 0x7fe365b91c50>
[s] spider <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
在 url 之前执行 strip()
请求,然后会给你想要的结果,如下所示,
In [3]: fetch(url.strip())
2015-08-02 05:53:01 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:53:03 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s] item {}
[s] request <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] response <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] settings <scrapy.settings.Settings object at 0x7fe365b91c50>
[s] spider <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
我对这个几乎束手无策。基本上我有一个 url 似乎有点神奇。具体是这样的:
https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031
当我用requests
点击它时,一切正常:
import requests
test = requests.get("https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031")
<Response [200]>
但是,当我使用scrapy
时,会弹出以下行:
Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
我什至尝试更新我的用户代理字符串,但无济于事。我的一部分担心 URL-encoding %0A
是负责任的,但这似乎很奇怪,我找不到任何关于如何修复它的文档。
作为参考,这是我发送请求的方式,但我不确定这会增加多少信息:
for url in review_urls:
yield scrapy.Request(url, callback=self.get_review_urls)
重要的是要注意这是例外情况而不是规则。大多数 URL 都可以不受阻碍地工作,但这些极端情况并不少见。
我不认为这是 scrapy 的问题,我怀疑你的 review_urls
、
请从 scrapy-shell
中找到此演示,在 url-encoding
[=15] 期间,您的 url 以换行符结束(文档 here) =] 转换为 %0A
。似乎您不小心在 url 末尾添加了换行符,或者提取的 url 包含额外的换行符。
scrapy shell 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031'
2015-08-02 05:48:56 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-08-02 05:48:56 [scrapy] INFO: Optional features available: ssl, http11
2015-08-02 05:48:56 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-08-02 05:48:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-08-02 05:48:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-02 05:48:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-02 05:48:56 [scrapy] INFO: Enabled item pipelines:
2015-08-02 05:48:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-02 05:48:56 [scrapy] INFO: Spider opened
2015-08-02 05:48:58 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:48:59 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s] item {}
[s] request <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] response <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] settings <scrapy.settings.Settings object at 0x7fe365b91c50>
[s] spider <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2015-08-02 05:48:59 [root] DEBUG: Using default logger
2015-08-02 05:48:59 [root] DEBUG: Using default logger
In [1]: url = 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031\n'
In [2]: fetch(url)
2015-08-02 05:49:24 [scrapy] DEBUG: Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s] item {}
[s] request <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s] response <404 https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s] settings <scrapy.settings.Settings object at 0x7fe365b91c50>
[s] spider <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
在 url 之前执行 strip()
请求,然后会给你想要的结果,如下所示,
In [3]: fetch(url.strip())
2015-08-02 05:53:01 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:53:03 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s] item {}
[s] request <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] response <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s] settings <scrapy.settings.Settings object at 0x7fe365b91c50>
[s] spider <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser