Scrapy 和封装

Scrapy and Incapsula

我正在尝试使用 Scrapy 和 Splash 从网站 "whoscored.com" 检索数据。 这是我的设置:

BOT_NAME = 'scrapy_matchs'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_matchs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 20
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_matchs.pipelines.ScrapyMatchsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 30
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

在此之前,我只使用 Splash,在被 Incapsula 阻止之前,我至少能够请求 2 或 3 个页面。 但是使用 Scrapy,我在第一次请求后立即被阻止。

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript">
  </script>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=22&amp;xinfo=14-58014137-0%200NNN%20RT%281572446923864%2084%29%20q%280%20-1%20-1%202%29%20r%280%20-1%29%20B17%284%2c200%2c0%29%20U18&amp;incident_id=727001300034907080-167681622137047086&amp;edet=17&amp;cinfo=04000000&amp;rpinfo=0" width="100%">
   Request unsuccessful. Incapsula incident ID: 727001300034907080-167681622137047086
  </iframe>
 </body>
</html>

为什么我这么容易被屏蔽?我应该更改我的设置吗?

提前谢谢你。

他们有没有可能记录了您之前的抓取活动?那个Scrapy不负责?有吗?

USER_AGENT = 'scrapy_matchs (+http://www.yourdomain.com)'

这部分也让我想到了我自己的 Web 服务器日志文件,其中包含 github.com/masscan 等 url。如果该域与抓取有关,或者如果它包含短语 scrapy,我不会因为禁止它们而感到难过。 一定要遵守 robots.txt 规则,机器人不要检查它会让你看起来很糟糕 ;) 而且我不会使用那么多用户代理。我也喜欢为站点获取默认值 headers 并将其放置在您自己的位置上的想法。如果我的网站受到大量爬行内容的攻击,我可以想象根据用户是否有看起来 odd/off 的请求 headers 来过滤用户。

我建议你...

  1. nmap 扫描站点,找出他们使用的 Web 服务器。
  2. 使用最基本的设置在您的本地计算机上安装和设置它。(打开所有登录参数,大多数服务器关闭了一些)
  3. 检查该服务器的日志文件,并检查您的抓取流量与您的浏览器连接到该站点的情况。
  4. 然后想办法让前者看起来和后者一模一样。
  5. 如果 none 的工作可以缓解问题,请不要使用 scrapy 只需使用 selenium 和真正的用户代理自动通过您的网页上的抓取代码 运行 浏览网站用户自动化。
  6. 我还建议您通过代理或其他方法使用不同的 ip,因为看起来您的 ip 可能在某处的某个禁止列表中。
  7. 如果 AWS 免费版允许您通过您在连接到 AWS 服务器的计算机上设置的 ssh 代理端口连接到该站点,那么这将是一种检查站点安全性的简单方法,这意味着它们没有被禁止您正在使用的 AWS 服务器,我认为这意味着它们缺乏安全性,因为基本上地球上的所有 AWS 服务器似乎每天都会扫描我的 Pi。
  8. 在星巴克旁边的图书馆做这项工作,旁边有...有免费 wifi 和不同的 ip 地址会很好。