Using Leafproxies proxy for scraping, ValueError: Port could not be cast to integer value

Using Leafproxies proxy for scraping, ValueError: Port could not be cast to integer value

我是一个 Scrapy 爱好者,已经 3 个月了。因为实在是太喜欢刷屏了,结果郁闷又兴奋的在Leafpad买了个代理包。

不幸的是,当我将它们上传到我的 Scrapy 蜘蛛时,我收到了 ValueError:

我使用了 scrapy-rotating-proxies 来集成代理。我添加了不是数字而是字符串 url 的代理,如下所示:

ROTATING_PROXY_LIST = [
    "us-retail-fast.resdleafproxies.com:5000:ksre9jXXXXXXXXI38HJg5:XXX9nh",
    "us-retail-fast.resdleafproxies.com:5000:ksre9jvXXXXXXXXk+zHtjyZRG:XXXXtf9nh",
    # ...
]

DOWNLOADER_MIDDLEWARES = {
      'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 800
                
    }

Scrapy 日志:

draco@draco:~/docs/scraping/scrapyyy/thomas$ scrapy crawl home2 -o all_np4.csv
/home/draco/.local/lib/python3.8/site-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:

  HomeSpider named 'home' (in thomas.spiders.home)

  HomeSpider named 'home' (in thomas.spiders.home3)

  This can cause unexpected behavior.
  warnings.warn(
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: thomas)
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-30-generic-x86_64-with-glibc2.29
2022-02-21 00:16:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-21 00:16:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'thomas',
 'CLOSESPIDER_ERRORCOUNT': 10,
 'CONCURRENT_REQUESTS': 3,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
 'CONCURRENT_REQUESTS_PER_IP': 5,
 'COOKIES_ENABLED': False,
 'DNS_TIMEOUT': 10,
 'DOWNLOAD_DELAY': 2,
 'DOWNLOAD_TIMEOUT': 200,
 'NEWSPIDER_MODULE': 'thomas.spiders',
 'SPIDER_MODULES': ['thomas.spiders']}
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet Password: 536c802b585074b3
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'rotating_proxies.middlewares.RotatingProxyMiddleware',
 'rotating_proxies.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'thomas.middlewares.UserAgentRotatorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'thomas.middlewares.ThomasSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-21 00:16:51 [scrapy.core.engine] INFO: Spider opened
2022-02-21 00:16:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-21 00:16:51 [home2] INFO: Spider opened: home2
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-21 00:16:51 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 30, reanimated: 0, mean backoff time: 0s)
INITIAL REQUEST
OPENING LIST https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden
OPENING LIST https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau
OPENING LIST https://www.homegate.ch/buy/apartment/canton-zurich/matching-list

    2022-02-21 00:16:51 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006:XXXXXj: XXXXXXXtf9nh> is DEAD
#....
    2022-02-21 00:17:02 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 2 times, max retries: 5)
  esdleafproxies.com:5005:ksre9jva95etajxxaoll9k+cw17qdyl:xxxx9nh> is DEAD
    2022-02-21 00:17:21 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> with another proxy (failed 5 times, max retries: 5)
    2022-02-21 00:17:23 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXjxxaoll9k+ZcGvdwJf:XXXXXXXtf9nh> is DEAD
    2022-02-21 00:17:23 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 5 times, max retries: 5)
    2022-02-21 00:17:25 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproXXXXXXXsre9jva95etajxxaoll9k+oFx6kEXE:xxxxxxxtf9nh> is DEAD
    2022-02-21 00:17:25 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden> (failed 6 times with different proxies)
    OPENING LIST https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400
    2022-02-21 00:17:25 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden>
    Traceback (most recent call last):
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
        result = current_context.run(
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
        return (yield download_func(request=request, spider=spider))
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
        result = f(*args, **kw)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
        return handler.download_request(request, spider)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
        return agent.download_request(request)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
        agent = self._get_agent(request, timeout)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
        return _parsed_url_args(parsed)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
        port = parsed.port
      File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
        raise ValueError(message) from None
    ValueError: Port could not be cast to integer value as '5007:ksre9jva95etajxxaoll9k+oFx6kEXE:XXXXtf9nh'
    2022-02-21 00:17:28 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006xxxxxxxxetajxxaoll9k+V2UowimU:XXXXXXf9nh> is DEAD
    2022-02-21 00:17:28 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> (failed 6 times with different proxies)
    2022-02-21 00:17:28 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau>
    Traceback (most recent call last):
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
        result = current_context.run(
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
        return (yield download_func(request=request, spider=spider))
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
        result = f(*args, **kw)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
        return handler.download_request(request, spider)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
        return agent.download_request(request)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
        agent = self._get_agent(request, timeout)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
        return _parsed_url_args(parsed)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
        port = parsed.port
      File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
        raise ValueError(message) from None
    ValueError: Port could not be cast to integer value as '5006:ksre9jva95etajxxaoll9k+XXXXXX'
    2022-02-21 00:17:30 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5004:XXXXXXX5etajxxaoll9k+fbg56Ioj:XXXXf9nh> is DEAD
    2022-02-21 00:17:30 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> (failed 6 times with different proxies)
    2022-02-21 00:17:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list>
    Traceback (most recent call last):
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
        result = current_context.run(
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
        return (yield download_func(request=request, spider=spider))
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
        result = f(*args, **kw)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
        return handler.download_request(request, spider)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
        return agent.download_request(request)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
        agent = self._get_agent(request, timeout)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
        return _parsed_url_args(parsed)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
        port = parsed.port
      File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
        raise ValueError(message) from None
    ValueError: Port could not be cast to integer value as '5004:XXXXXva95etajxxaoll9k+fbg56Ioj:XXXXXtf9nh'
    2022-02-21 00:17:31 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
    2022-02-21 00:17:33 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5000:XXXXXajxxaoll9k+zHtjyZRG:XXXX9nh> is DEAD
    2022-02-21 00:17:33 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 1 times, max retries: 5)
    2022-02-21 00:17:36 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXXXXetajxxaoll9k+uSsCeYH5:lXXXXXXmtf9nh> is DEAD
    2022-02-21 00:17:36 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 2 times, max retries: 5)
    
    
    ValueError: Port could not be cast to integer value as '5009:ksre9jva95etajxxaoll9k+HOggeKA3:XXXXXh'
    2022-02-21 00:17:47 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-21 00:17:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'bans/error/builtins.ValueError': 24,
     'downloader/exception_count': 24,
     'downloader/exception_type_count/builtins.ValueError': 24,
     'downloader/request_bytes': 7158,
     'downloader/request_count': 24,
     'downloader/request_method_count/GET': 24,
     'elapsed_time_seconds': 55.895942,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2022, 2, 20, 21, 17, 47, 135433),
     'log_count/DEBUG': 50,
     'log_count/ERROR': 4,
     'log_count/INFO': 13,
     'memusage/max': 65073152,
     'memusage/startup': 65073152,
     'proxies/dead': 21,
     'proxies/mean_backoff': 196.90260209397636,
     'proxies/reanimated': 1,
     'proxies/unchecked': 9,
     'scheduler/dequeued': 24,
     'scheduler/dequeued/memory': 24,
     'scheduler/enqueued': 24,
     'scheduler/enqueued/memory': 24,
     'start_time': datetime.datetime(2022, 2, 20, 21, 16, 51, 239491)}
    2022-02-21 00:17:47 [scrapy.core.engine] INFO: Spider closed (finished)

问题可能是什么?

我在 Leafproxies 的代理会员资格是“Residential Proxies”。 Leafproxies 没有提供任何关于它的细节以及如何使用它的信息。据我了解,没有真正的消费者支持,只有 Discord 频道。

这是Leafproxies给出的面板。我从下面列出的代理。没有数据使用记录

您定义代理列表的方式不正确。您需要使用格式 username:password@server:port 而不是 server:port:username:password。尝试使用以下定义:

ROTATING_PROXY_LIST= [
    "https://ksre9jva95etajxxaoll9k+JI38HJg5:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5000",
    "https://ksre9jva95etajxxaoll9k+zHtjyZRG:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5001",
]
DOWNLOADER_MIDDLEWARES = {
    # ...
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 810,
    # ...
}

注意: 您已将您的凭据暴露在互联网上,因此看到此问题的任何人都可以免费使用您的代理服务。考虑尽快撤销凭据。

您可能面临的第二个问题是,您正在抓取的网站可能已经禁止了某些代理,因此您将收到失败的响应。所以使用代理时需要增加RETRIES的值