Scrapyd 破坏响应?
Scrapyd corrupting response?
我正在尝试抓取特定网站。我用来抓取它的代码与用来成功抓取许多其他网站的代码相同。
但是,生成的 response.body
看起来完全损坏(下面的片段):
����)/A���(��Ե�e�)k�Gl�*�EI�
����:gh��x@����y�F$F�_��%+�\��r1��ND~l""�54بN�:�FA��W
b� �\�F�M��C�o.�7z�Tz|~0��̔HgA�\���[��������:*i�P��Jpdh�v�01]�Ӟ_e�b߇��,�X��E, ��냬�e��Ϣ�5�Ϭ�B<p�A��~�3t3'>N=`
因此无法解析。
真正令人困惑的是,如果我 运行 scrapy shell
在同一个 URL 上,一切正常(网站的字符集是 utf-8)---这是领先的我相信这是由 scrapyd 引起的。
如果有任何建议,我将不胜感激。
SETTINGS.py
# -*- coding: utf-8 -*-
BOT_NAME = "[name]"
SPIDER_MODULES = ["[name].spiders"]
NEWSPIDER_MODULE = "[name].spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '[name] (+http://www.yourdomain.com)'
ROBOTSTXT_OBEY = False
CRAWLERA_MAX_CONCURRENT = 50
CONCURRENT_REQUESTS = CRAWLERA_MAX_CONCURRENT
CONCURRENT_REQUESTS_PER_DOMAIN = CRAWLERA_MAX_CONCURRENT
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
DUPEFILTER_DEBUG = True
COOKIES_ENABLED = False # Disable cookies (enabled by default)
DEFAULT_REQUEST_HEADERS = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
"accept-encoding": "gzip, deflate, br",
}
DOWNLOADER_MIDDLEWARES = {
"scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 200,
"scrapy_crawlera.CrawleraMiddleware": 300,
}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = "KEY"
ITEM_PIPELINES = {
"[name].pipelines.Export": 400,
}
# sentry dsn
SENTRY_DSN = "Key"
EXTENSIONS = {
"[name].extensions.SentryLogging": -1, # Load SentryLogging extension before others
}```
感谢Serhii的建议,我发现问题是由于"accept-encoding": "gzip, deflate, br"
:我接受了压缩站点,但没有在scrapy中处理它们。
添加 scrapy.downloadermiddlewares.httpcompression
或简单地删除 accept-encoding
行可以解决问题。
我正在尝试抓取特定网站。我用来抓取它的代码与用来成功抓取许多其他网站的代码相同。
但是,生成的 response.body
看起来完全损坏(下面的片段):
����)/A���(��Ե�e�)k�Gl�*�EI�
����:gh��x@����y�F$F�_��%+�\��r1��ND~l""�54بN�:�FA��W
b� �\�F�M��C�o.�7z�Tz|~0��̔HgA�\���[��������:*i�P��Jpdh�v�01]�Ӟ_e�b߇��,�X��E, ��냬�e��Ϣ�5�Ϭ�B<p�A��~�3t3'>N=`
因此无法解析。
真正令人困惑的是,如果我 运行 scrapy shell
在同一个 URL 上,一切正常(网站的字符集是 utf-8)---这是领先的我相信这是由 scrapyd 引起的。
如果有任何建议,我将不胜感激。
SETTINGS.py
# -*- coding: utf-8 -*-
BOT_NAME = "[name]"
SPIDER_MODULES = ["[name].spiders"]
NEWSPIDER_MODULE = "[name].spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '[name] (+http://www.yourdomain.com)'
ROBOTSTXT_OBEY = False
CRAWLERA_MAX_CONCURRENT = 50
CONCURRENT_REQUESTS = CRAWLERA_MAX_CONCURRENT
CONCURRENT_REQUESTS_PER_DOMAIN = CRAWLERA_MAX_CONCURRENT
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
DUPEFILTER_DEBUG = True
COOKIES_ENABLED = False # Disable cookies (enabled by default)
DEFAULT_REQUEST_HEADERS = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
"accept-encoding": "gzip, deflate, br",
}
DOWNLOADER_MIDDLEWARES = {
"scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 200,
"scrapy_crawlera.CrawleraMiddleware": 300,
}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = "KEY"
ITEM_PIPELINES = {
"[name].pipelines.Export": 400,
}
# sentry dsn
SENTRY_DSN = "Key"
EXTENSIONS = {
"[name].extensions.SentryLogging": -1, # Load SentryLogging extension before others
}```
感谢Serhii的建议,我发现问题是由于"accept-encoding": "gzip, deflate, br"
:我接受了压缩站点,但没有在scrapy中处理它们。
添加 scrapy.downloadermiddlewares.httpcompression
或简单地删除 accept-encoding
行可以解决问题。