Using Leafproxies proxy for scraping, ValueError: Port could not be cast to integer value
Using Leafproxies proxy for scraping, ValueError: Port could not be cast to integer value
我是一个 Scrapy 爱好者,已经 3 个月了。因为实在是太喜欢刷屏了,结果郁闷又兴奋的在Leafpad买了个代理包。
不幸的是,当我将它们上传到我的 Scrapy 蜘蛛时,我收到了 ValueError:
我使用了 scrapy-rotating-proxies 来集成代理。我添加了不是数字而是字符串 url 的代理,如下所示:
ROTATING_PROXY_LIST = [
"us-retail-fast.resdleafproxies.com:5000:ksre9jXXXXXXXXI38HJg5:XXX9nh",
"us-retail-fast.resdleafproxies.com:5000:ksre9jvXXXXXXXXk+zHtjyZRG:XXXXtf9nh",
# ...
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 800
}
Scrapy 日志:
draco@draco:~/docs/scraping/scrapyyy/thomas$ scrapy crawl home2 -o all_np4.csv
/home/draco/.local/lib/python3.8/site-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:
HomeSpider named 'home' (in thomas.spiders.home)
HomeSpider named 'home' (in thomas.spiders.home3)
This can cause unexpected behavior.
warnings.warn(
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: thomas)
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-30-generic-x86_64-with-glibc2.29
2022-02-21 00:16:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-21 00:16:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'thomas',
'CLOSESPIDER_ERRORCOUNT': 10,
'CONCURRENT_REQUESTS': 3,
'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
'CONCURRENT_REQUESTS_PER_IP': 5,
'COOKIES_ENABLED': False,
'DNS_TIMEOUT': 10,
'DOWNLOAD_DELAY': 2,
'DOWNLOAD_TIMEOUT': 200,
'NEWSPIDER_MODULE': 'thomas.spiders',
'SPIDER_MODULES': ['thomas.spiders']}
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet Password: 536c802b585074b3
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'rotating_proxies.middlewares.RotatingProxyMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'thomas.middlewares.UserAgentRotatorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'thomas.middlewares.ThomasSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-21 00:16:51 [scrapy.core.engine] INFO: Spider opened
2022-02-21 00:16:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-21 00:16:51 [home2] INFO: Spider opened: home2
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-21 00:16:51 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 30, reanimated: 0, mean backoff time: 0s)
INITIAL REQUEST
OPENING LIST https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden
OPENING LIST https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau
OPENING LIST https://www.homegate.ch/buy/apartment/canton-zurich/matching-list
2022-02-21 00:16:51 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006:XXXXXj: XXXXXXXtf9nh> is DEAD
#....
2022-02-21 00:17:02 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 2 times, max retries: 5)
esdleafproxies.com:5005:ksre9jva95etajxxaoll9k+cw17qdyl:xxxx9nh> is DEAD
2022-02-21 00:17:21 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:23 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXjxxaoll9k+ZcGvdwJf:XXXXXXXtf9nh> is DEAD
2022-02-21 00:17:23 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:25 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproXXXXXXXsre9jva95etajxxaoll9k+oFx6kEXE:xxxxxxxtf9nh> is DEAD
2022-02-21 00:17:25 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden> (failed 6 times with different proxies)
OPENING LIST https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400
2022-02-21 00:17:25 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5007:ksre9jva95etajxxaoll9k+oFx6kEXE:XXXXtf9nh'
2022-02-21 00:17:28 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006xxxxxxxxetajxxaoll9k+V2UowimU:XXXXXXf9nh> is DEAD
2022-02-21 00:17:28 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> (failed 6 times with different proxies)
2022-02-21 00:17:28 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5006:ksre9jva95etajxxaoll9k+XXXXXX'
2022-02-21 00:17:30 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5004:XXXXXXX5etajxxaoll9k+fbg56Ioj:XXXXf9nh> is DEAD
2022-02-21 00:17:30 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> (failed 6 times with different proxies)
2022-02-21 00:17:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5004:XXXXXva95etajxxaoll9k+fbg56Ioj:XXXXXtf9nh'
2022-02-21 00:17:31 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2022-02-21 00:17:33 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5000:XXXXXajxxaoll9k+zHtjyZRG:XXXX9nh> is DEAD
2022-02-21 00:17:33 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 1 times, max retries: 5)
2022-02-21 00:17:36 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXXXXetajxxaoll9k+uSsCeYH5:lXXXXXXmtf9nh> is DEAD
2022-02-21 00:17:36 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 2 times, max retries: 5)
ValueError: Port could not be cast to integer value as '5009:ksre9jva95etajxxaoll9k+HOggeKA3:XXXXXh'
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-21 00:17:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/builtins.ValueError': 24,
'downloader/exception_count': 24,
'downloader/exception_type_count/builtins.ValueError': 24,
'downloader/request_bytes': 7158,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'elapsed_time_seconds': 55.895942,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 20, 21, 17, 47, 135433),
'log_count/DEBUG': 50,
'log_count/ERROR': 4,
'log_count/INFO': 13,
'memusage/max': 65073152,
'memusage/startup': 65073152,
'proxies/dead': 21,
'proxies/mean_backoff': 196.90260209397636,
'proxies/reanimated': 1,
'proxies/unchecked': 9,
'scheduler/dequeued': 24,
'scheduler/dequeued/memory': 24,
'scheduler/enqueued': 24,
'scheduler/enqueued/memory': 24,
'start_time': datetime.datetime(2022, 2, 20, 21, 16, 51, 239491)}
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Spider closed (finished)
问题可能是什么?
我在 Leafproxies 的代理会员资格是“Residential Proxies”。 Leafproxies 没有提供任何关于它的细节以及如何使用它的信息。据我了解,没有真正的消费者支持,只有 Discord 频道。
这是Leafproxies给出的面板。我从下面列出的代理。没有数据使用记录
您定义代理列表的方式不正确。您需要使用格式 username:password@server:port
而不是 server:port:username:password
。尝试使用以下定义:
ROTATING_PROXY_LIST= [
"https://ksre9jva95etajxxaoll9k+JI38HJg5:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5000",
"https://ksre9jva95etajxxaoll9k+zHtjyZRG:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5001",
]
DOWNLOADER_MIDDLEWARES = {
# ...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 810,
# ...
}
注意:
您已将您的凭据暴露在互联网上,因此看到此问题的任何人都可以免费使用您的代理服务。考虑尽快撤销凭据。
您可能面临的第二个问题是,您正在抓取的网站可能已经禁止了某些代理,因此您将收到失败的响应。所以使用代理时需要增加RETRIES
的值
我是一个 Scrapy 爱好者,已经 3 个月了。因为实在是太喜欢刷屏了,结果郁闷又兴奋的在Leafpad买了个代理包。
不幸的是,当我将它们上传到我的 Scrapy 蜘蛛时,我收到了 ValueError:
我使用了 scrapy-rotating-proxies 来集成代理。我添加了不是数字而是字符串 url 的代理,如下所示:
ROTATING_PROXY_LIST = [
"us-retail-fast.resdleafproxies.com:5000:ksre9jXXXXXXXXI38HJg5:XXX9nh",
"us-retail-fast.resdleafproxies.com:5000:ksre9jvXXXXXXXXk+zHtjyZRG:XXXXtf9nh",
# ...
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 800
}
Scrapy 日志:
draco@draco:~/docs/scraping/scrapyyy/thomas$ scrapy crawl home2 -o all_np4.csv
/home/draco/.local/lib/python3.8/site-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:
HomeSpider named 'home' (in thomas.spiders.home)
HomeSpider named 'home' (in thomas.spiders.home3)
This can cause unexpected behavior.
warnings.warn(
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: thomas)
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-30-generic-x86_64-with-glibc2.29
2022-02-21 00:16:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-21 00:16:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'thomas',
'CLOSESPIDER_ERRORCOUNT': 10,
'CONCURRENT_REQUESTS': 3,
'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
'CONCURRENT_REQUESTS_PER_IP': 5,
'COOKIES_ENABLED': False,
'DNS_TIMEOUT': 10,
'DOWNLOAD_DELAY': 2,
'DOWNLOAD_TIMEOUT': 200,
'NEWSPIDER_MODULE': 'thomas.spiders',
'SPIDER_MODULES': ['thomas.spiders']}
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet Password: 536c802b585074b3
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'rotating_proxies.middlewares.RotatingProxyMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'thomas.middlewares.UserAgentRotatorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'thomas.middlewares.ThomasSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-21 00:16:51 [scrapy.core.engine] INFO: Spider opened
2022-02-21 00:16:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-21 00:16:51 [home2] INFO: Spider opened: home2
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-21 00:16:51 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 30, reanimated: 0, mean backoff time: 0s)
INITIAL REQUEST
OPENING LIST https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden
OPENING LIST https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau
OPENING LIST https://www.homegate.ch/buy/apartment/canton-zurich/matching-list
2022-02-21 00:16:51 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006:XXXXXj: XXXXXXXtf9nh> is DEAD
#....
2022-02-21 00:17:02 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 2 times, max retries: 5)
esdleafproxies.com:5005:ksre9jva95etajxxaoll9k+cw17qdyl:xxxx9nh> is DEAD
2022-02-21 00:17:21 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:23 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXjxxaoll9k+ZcGvdwJf:XXXXXXXtf9nh> is DEAD
2022-02-21 00:17:23 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:25 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproXXXXXXXsre9jva95etajxxaoll9k+oFx6kEXE:xxxxxxxtf9nh> is DEAD
2022-02-21 00:17:25 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden> (failed 6 times with different proxies)
OPENING LIST https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400
2022-02-21 00:17:25 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5007:ksre9jva95etajxxaoll9k+oFx6kEXE:XXXXtf9nh'
2022-02-21 00:17:28 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006xxxxxxxxetajxxaoll9k+V2UowimU:XXXXXXf9nh> is DEAD
2022-02-21 00:17:28 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> (failed 6 times with different proxies)
2022-02-21 00:17:28 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5006:ksre9jva95etajxxaoll9k+XXXXXX'
2022-02-21 00:17:30 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5004:XXXXXXX5etajxxaoll9k+fbg56Ioj:XXXXf9nh> is DEAD
2022-02-21 00:17:30 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> (failed 6 times with different proxies)
2022-02-21 00:17:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5004:XXXXXva95etajxxaoll9k+fbg56Ioj:XXXXXtf9nh'
2022-02-21 00:17:31 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2022-02-21 00:17:33 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5000:XXXXXajxxaoll9k+zHtjyZRG:XXXX9nh> is DEAD
2022-02-21 00:17:33 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 1 times, max retries: 5)
2022-02-21 00:17:36 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXXXXetajxxaoll9k+uSsCeYH5:lXXXXXXmtf9nh> is DEAD
2022-02-21 00:17:36 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 2 times, max retries: 5)
ValueError: Port could not be cast to integer value as '5009:ksre9jva95etajxxaoll9k+HOggeKA3:XXXXXh'
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-21 00:17:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/builtins.ValueError': 24,
'downloader/exception_count': 24,
'downloader/exception_type_count/builtins.ValueError': 24,
'downloader/request_bytes': 7158,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'elapsed_time_seconds': 55.895942,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 20, 21, 17, 47, 135433),
'log_count/DEBUG': 50,
'log_count/ERROR': 4,
'log_count/INFO': 13,
'memusage/max': 65073152,
'memusage/startup': 65073152,
'proxies/dead': 21,
'proxies/mean_backoff': 196.90260209397636,
'proxies/reanimated': 1,
'proxies/unchecked': 9,
'scheduler/dequeued': 24,
'scheduler/dequeued/memory': 24,
'scheduler/enqueued': 24,
'scheduler/enqueued/memory': 24,
'start_time': datetime.datetime(2022, 2, 20, 21, 16, 51, 239491)}
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Spider closed (finished)
问题可能是什么?
我在 Leafproxies 的代理会员资格是“Residential Proxies”。 Leafproxies 没有提供任何关于它的细节以及如何使用它的信息。据我了解,没有真正的消费者支持,只有 Discord 频道。
这是Leafproxies给出的面板。我从下面列出的代理。没有数据使用记录
您定义代理列表的方式不正确。您需要使用格式 username:password@server:port
而不是 server:port:username:password
。尝试使用以下定义:
ROTATING_PROXY_LIST= [
"https://ksre9jva95etajxxaoll9k+JI38HJg5:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5000",
"https://ksre9jva95etajxxaoll9k+zHtjyZRG:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5001",
]
DOWNLOADER_MIDDLEWARES = {
# ...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 810,
# ...
}
注意: 您已将您的凭据暴露在互联网上,因此看到此问题的任何人都可以免费使用您的代理服务。考虑尽快撤销凭据。
您可能面临的第二个问题是,您正在抓取的网站可能已经禁止了某些代理,因此您将收到失败的响应。所以使用代理时需要增加RETRIES
的值