Scrapy:所有请求都出现 403 错误

Scrapy: 403 Error on all requests

我的 scrapy 爬虫使用 random proxies,它可以在我的电脑上运行。但是当我 运行 它在 vps 上时,它在每个请求上 return 403 错误。

2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.29:2716>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.173:5195>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.93:3410>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden

我在 vps 上手动检查了 firefox 的代理,我可以正常访问网站。

这是我的设置,和我电脑上的一样:

DOWNLOADER_MIDDLEWARES = {
   # 'monitor.middlewares.MonitorDownloaderMiddleware': 543,
   # Proxies
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    # Proxies end
    # Useragent
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'random_useragent.RandomUserAgentMiddleware': 400,
    # Useragent end
}

# Random useragent list
USER_AGENT_LIST = r"C:\Users\Administrator\Desktop\useragents.txt"

# Retry many times since proxies often fail
RETRY_TIMES = 5
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
PROXY_LIST = r"C:\Users\Administrator\Desktop\proxies.txt"

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

不确定是什么问题,但发现很多人在使用 scrapy_proxies 时遇到问题。我正在使用 scrapy-rotating-proxies instead. It's maintained by kmike 也维护 scrapy 框架,所以我想它更好。

有时您会收到 403,因为 robots.txt 禁止在整个网站或您正在废弃的部分网站上使用机器人。

那么首先,写在settings.py ROBOTSTXT_OBEY = False。我在这里的设置中没有看到它。

不考虑robots.txt不够笼统。您必须将您的用户代理设置为常规浏览器,在 settings.py 中也是如此。例如:USER_AGENT='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7' 最好的办法是在设置中创建一组用户代理列表,例如:

USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
    ...,
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]

你好像做到了。然后让它随机化,你好像也是这样。

最后,这是一个可选的,但我让你看看它是否对你有用,在 settings.py 中写一个 DOWNLOAD_DELAY = 3,其中值至少为 1。理想的是也让它随机。它使您的蜘蛛像浏览器一样工作。据我所知,过快的下载延迟会使网站明白这是一个由虚假用户代理组成的机器人。如果网站管理员有很多经验,他会制定有很多障碍的规则来保护他的网站免受机器人攻击。

我今天早上在我的 scrapy shell 中测试了它,解决了和你一样的问题。希望对你有用。