Scrapyd 破坏响应?

Scrapyd corrupting response?

我正在尝试抓取特定网站。我用来抓取它的代码与用来成功抓取许多其他网站的代码相同。

但是,生成的 response.body 看起来完全损坏(下面的片段):

����)/A���(��Ե�e�)k�Gl�*�EI�
                             ����:gh��x@����y�F$F�_��%+�\��r1��ND~l""�54بN�:�FA��W
b� �\�F�M��C�o.�7z�Tz|~΢0��̔HgA�\���[��������:*i�P��Jpdh�v�01]�Ӟ_e�b߇��,�X��E, ��냬�e��Ϣ�5�Ϭ�B<p�A��~�3t3'>N=`

因此无法解析。

真正令人困惑的是,如果我 运行 scrapy shell 在同一个 URL 上,一切正常(网站的字符集是 utf-8)---这是领先的我相信这是由 scrapyd 引起的。

如果有任何建议,我将不胜感激。

SETTINGS.py

# -*- coding: utf-8 -*-

BOT_NAME = "[name]"

SPIDER_MODULES = ["[name].spiders"]
NEWSPIDER_MODULE = "[name].spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '[name] (+http://www.yourdomain.com)'

ROBOTSTXT_OBEY = False

CRAWLERA_MAX_CONCURRENT = 50
CONCURRENT_REQUESTS = CRAWLERA_MAX_CONCURRENT
CONCURRENT_REQUESTS_PER_DOMAIN = CRAWLERA_MAX_CONCURRENT

AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
DUPEFILTER_DEBUG = True

COOKIES_ENABLED = False  # Disable cookies (enabled by default)

DEFAULT_REQUEST_HEADERS = {
    "X-Crawlera-Profile": "desktop",
    "X-Crawlera-Cookies": "disable",
    "accept-encoding": "gzip, deflate, br",
}

DOWNLOADER_MIDDLEWARES = {
    "scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 200,
    "scrapy_crawlera.CrawleraMiddleware": 300,
}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = "KEY"

ITEM_PIPELINES = {
    "[name].pipelines.Export": 400,
}
# sentry dsn
SENTRY_DSN = "Key"

EXTENSIONS = {
    "[name].extensions.SentryLogging": -1,  # Load SentryLogging extension before others
}```

感谢Serhii的建议,我发现问题是由于"accept-encoding": "gzip, deflate, br":我接受了压缩站点,但没有在scrapy中处理它们。

添加 scrapy.downloadermiddlewares.httpcompression 或简单地删除 accept-encoding 行可以解决问题。