将 Scrapy over Splash 与 HTTP 代理结合使用时出现“500 Internal Server Error”
"500 Internal Server Error" when combining Scrapy over Splash with an HTTP proxy
我正在尝试使用 Splash(呈现 JavaScript)和 Tor 通过 Privoxy(提供匿名)在 Docker 容器中抓取 Scrapy 蜘蛛。这是我为此使用的 docker-compose.yml
:
version: '3'
services:
scraper:
build: ./apk_splash
# environment:
# - http_proxy=http://tor-privoxy:8118
links:
- tor-privoxy
- splash
tor-privoxy:
image: rdsubhas/tor-privoxy-alpine
splash:
image: scrapinghub/splash
其中 Scraper 具有以下 Dockerfile
:
FROM python:alpine
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash
RUN pip install scrapy scrapy-splash scrapy-fake-useragent
COPY . /scraper
WORKDIR /scraper
CMD ["scrapy", "crawl", "apkmirror"]
我要抓取的蜘蛛是
import scrapy
from scrapy_splash import SplashRequest
from apk_splash.items import ApkmirrorItem
class ApkmirrorSpider(scrapy.Spider):
name = 'apkmirror'
allowed_domains = ['apkmirror.com']
start_urls = [
'http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/',
]
custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html', args={'wait': 0.5})
def parse(self, response):
item = ApkmirrorItem()
item['url'] = response.url
item['developer'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){1}[^/]+/$")]/text()').extract_first()
item['app'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){2}[^/]+/$")]/text()').extract_first()
item['version'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){3}[^/]+/$")]/text()').extract_first()
yield item
我在 settings.py
中添加了以下内容:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://splash:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
注释掉 scraper
容器的 environment
后,Scraper 或多或少会起作用。我收到包含以下内容的日志:
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (referer: None)
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/>
scraper_1 | {'app': 'Androbench (Storage Benchmark)',
scraper_1 | 'developer': 'CSL@SKKU',
scraper_1 | 'url': 'http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/',
scraper_1 | 'version': '5.0'}
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-07-11 13:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {'downloader/request_bytes': 1508,
scraper_1 | 'downloader/request_count': 3,
scraper_1 | 'downloader/request_method_count/GET': 2,
scraper_1 | 'downloader/request_method_count/POST': 1,
scraper_1 | 'downloader/response_bytes': 190320,
scraper_1 | 'downloader/response_count': 3,
scraper_1 | 'downloader/response_status_count/200': 2,
scraper_1 | 'downloader/response_status_count/404': 1,
scraper_1 | 'finish_reason': 'finished',
scraper_1 | 'finish_time': datetime.datetime(2017, 7, 11, 13, 57, 19, 488874),
scraper_1 | 'item_scraped_count': 1,
scraper_1 | 'log_count/DEBUG': 5,
scraper_1 | 'log_count/INFO': 7,
scraper_1 | 'memusage/max': 49131520,
scraper_1 | 'memusage/startup': 49131520,
scraper_1 | 'response_received_count': 3,
scraper_1 | 'scheduler/dequeued': 2,
scraper_1 | 'scheduler/dequeued/memory': 2,
scraper_1 | 'scheduler/enqueued': 2,
scraper_1 | 'scheduler/enqueued/memory': 2,
scraper_1 | 'splash/render.html/request_count': 1,
scraper_1 | 'splash/render.html/response_count/200': 1,
scraper_1 | 'start_time': datetime.datetime(2017, 7, 11, 13, 57, 13, 788850)}
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
apksplashcompose_scraper_1 exited with code 0
但是,如果我在 docker-compose.yml
中的 environment
行发表评论,我会收到 500 内部服务器错误:
scraper_1 | 2017-07-11 14:05:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (failed 3 times): 500 Internal Server Error
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (referer: None)
scraper_1 | 2017-07-11 14:05:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/>: HTTP status code is not handled or not allowed
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-07-11 14:05:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {'downloader/request_bytes': 3898,
scraper_1 | 'downloader/request_count': 7,
scraper_1 | 'downloader/request_method_count/GET': 4,
scraper_1 | 'downloader/request_method_count/POST': 3,
scraper_1 | 'downloader/response_bytes': 6839,
scraper_1 | 'downloader/response_count': 7,
scraper_1 | 'downloader/response_status_count/200': 1,
scraper_1 | 'downloader/response_status_count/500': 6,
scraper_1 | 'finish_reason': 'finished',
scraper_1 | 'finish_time': datetime.datetime(2017, 7, 11, 14, 5, 7, 866713),
scraper_1 | 'httperror/response_ignored_count': 1,
scraper_1 | 'httperror/response_ignored_status_count/500': 1,
scraper_1 | 'log_count/DEBUG': 10,
scraper_1 | 'log_count/INFO': 8,
scraper_1 | 'memusage/max': 49065984,
scraper_1 | 'memusage/startup': 49065984,
scraper_1 | 'response_received_count': 3,
scraper_1 | 'retry/count': 4,
scraper_1 | 'retry/max_reached': 2,
scraper_1 | 'retry/reason_count/500 Internal Server Error': 4,
scraper_1 | 'scheduler/dequeued': 4,
scraper_1 | 'scheduler/dequeued/memory': 4,
scraper_1 | 'scheduler/enqueued': 4,
scraper_1 | 'scheduler/enqueued/memory': 4,
scraper_1 | 'splash/render.html/request_count': 1,
scraper_1 | 'splash/render.html/response_count/500': 3,
scraper_1 | 'start_time': datetime.datetime(2017, 7, 11, 14, 4, 46, 717691)}
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Spider closed (finished)
apksplashcompose_scraper_1 exited with code 0
简而言之,当使用 Splash 呈现 JavaScript 时,我无法成功使用 HttpProxyMiddleware 以便通过 Privoxy 也使用 Tor。有人能看出这里出了什么问题吗?
更新
根据 Paul 的评论,我尝试按如下方式调整 splash
服务:
splash:
image: scrapinghub/splash
volumes:
- ./splash/proxy-profiles:/etc/splash/proxy-profiles
我在主目录中添加了一个 'splash' 目录,如下所示:
.
├── apk_splash
├── docker-compose.yml
└── splash
└── proxy-profiles
└── proxy.ini
和proxy.ini
读取
[proxy]
host=tor-privoxy
port=8118
据我了解,这应该使代理始终被使用(即 whitelist
默认为 ".*"
而没有 blacklist
)。
但是,如果我再次 docker-compose build
和 docker-compose up
,我仍然会收到 HTTP 500 错误。所以问题仍然是如何解决这些问题?
(顺便说一下,这个问题似乎与 https://github.com/scrapy-plugins/scrapy-splash/issues/117 相似;但是,我没有使用 Crawlera,所以我不确定如何调整答案)。
更新 2
在 Paul 的第二条评论之后,我检查了 tor-privoxy
通过这样做在容器内解析(虽然它仍然是 运行):
~$ docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
04909e6ef5cb apksplashcompose_scraper "scrapy crawl apkm..." 2 hours ago Up 8 seconds apksplashcompose_scraper_1
~$ docker exec -it $(docker ps -lq) /bin/bash
bash-4.3# python
Python 3.6.1 (default, Jun 19 2017, 23:58:41)
[GCC 5.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostbyname('tor-privoxy')
'172.22.0.2'
至于我是如何运行 Splash的,是通过链接容器,类似于https://splash.readthedocs.io/en/stable/install.html#docker-folder-sharing中描述的方式。我已验证 /etc/splash/proxy-profiles/proxy.ini
存在于容器中:
~$ docker exec -it apksplashcompose_splash_1 /bin/bash
root@b091fbef4c78:/# cd /etc/splash/proxy-profiles
root@b091fbef4c78:/etc/splash/proxy-profiles# ls
proxy.ini
root@b091fbef4c78:/etc/splash/proxy-profiles# cat proxy.ini
[proxy]
host=tor-privoxy
port=8118
我会尝试 Aquarium,但问题仍然是为什么当前设置不起作用?
结构如下Aquarium project as suggested by paul trmbrth, I found that it is essential to name the .ini file default.ini
, not proxy.ini
(otherwise it doesn't get 'picked up' automatically). I managed to get the scraper to work in this way (cf. my self-answer to ).
我正在尝试使用 Splash(呈现 JavaScript)和 Tor 通过 Privoxy(提供匿名)在 Docker 容器中抓取 Scrapy 蜘蛛。这是我为此使用的 docker-compose.yml
:
version: '3'
services:
scraper:
build: ./apk_splash
# environment:
# - http_proxy=http://tor-privoxy:8118
links:
- tor-privoxy
- splash
tor-privoxy:
image: rdsubhas/tor-privoxy-alpine
splash:
image: scrapinghub/splash
其中 Scraper 具有以下 Dockerfile
:
FROM python:alpine
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash
RUN pip install scrapy scrapy-splash scrapy-fake-useragent
COPY . /scraper
WORKDIR /scraper
CMD ["scrapy", "crawl", "apkmirror"]
我要抓取的蜘蛛是
import scrapy
from scrapy_splash import SplashRequest
from apk_splash.items import ApkmirrorItem
class ApkmirrorSpider(scrapy.Spider):
name = 'apkmirror'
allowed_domains = ['apkmirror.com']
start_urls = [
'http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/',
]
custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html', args={'wait': 0.5})
def parse(self, response):
item = ApkmirrorItem()
item['url'] = response.url
item['developer'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){1}[^/]+/$")]/text()').extract_first()
item['app'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){2}[^/]+/$")]/text()').extract_first()
item['version'] = response.css('.breadcrumbs').xpath('.//*[re:test(@href, "^/(?:[^/]+/){3}[^/]+/$")]/text()').extract_first()
yield item
我在 settings.py
中添加了以下内容:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://splash:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
注释掉 scraper
容器的 environment
后,Scraper 或多或少会起作用。我收到包含以下内容的日志:
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (referer: None)
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/>
scraper_1 | {'app': 'Androbench (Storage Benchmark)',
scraper_1 | 'developer': 'CSL@SKKU',
scraper_1 | 'url': 'http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/',
scraper_1 | 'version': '5.0'}
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-07-11 13:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {'downloader/request_bytes': 1508,
scraper_1 | 'downloader/request_count': 3,
scraper_1 | 'downloader/request_method_count/GET': 2,
scraper_1 | 'downloader/request_method_count/POST': 1,
scraper_1 | 'downloader/response_bytes': 190320,
scraper_1 | 'downloader/response_count': 3,
scraper_1 | 'downloader/response_status_count/200': 2,
scraper_1 | 'downloader/response_status_count/404': 1,
scraper_1 | 'finish_reason': 'finished',
scraper_1 | 'finish_time': datetime.datetime(2017, 7, 11, 13, 57, 19, 488874),
scraper_1 | 'item_scraped_count': 1,
scraper_1 | 'log_count/DEBUG': 5,
scraper_1 | 'log_count/INFO': 7,
scraper_1 | 'memusage/max': 49131520,
scraper_1 | 'memusage/startup': 49131520,
scraper_1 | 'response_received_count': 3,
scraper_1 | 'scheduler/dequeued': 2,
scraper_1 | 'scheduler/dequeued/memory': 2,
scraper_1 | 'scheduler/enqueued': 2,
scraper_1 | 'scheduler/enqueued/memory': 2,
scraper_1 | 'splash/render.html/request_count': 1,
scraper_1 | 'splash/render.html/response_count/200': 1,
scraper_1 | 'start_time': datetime.datetime(2017, 7, 11, 13, 57, 13, 788850)}
scraper_1 | 2017-07-11 13:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
apksplashcompose_scraper_1 exited with code 0
但是,如果我在 docker-compose.yml
中的 environment
行发表评论,我会收到 500 内部服务器错误:
scraper_1 | 2017-07-11 14:05:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (failed 3 times): 500 Internal Server Error
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/ via http://splash:8050/render.html> (referer: None)
scraper_1 | 2017-07-11 14:05:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://www.apkmirror.com/apk/cslskku/androbench-storage-benchmark/androbench-storage-benchmark-5-0-release/androbench-storage-benchmark-5-0-android-apk-download/>: HTTP status code is not handled or not allowed
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-07-11 14:05:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {'downloader/request_bytes': 3898,
scraper_1 | 'downloader/request_count': 7,
scraper_1 | 'downloader/request_method_count/GET': 4,
scraper_1 | 'downloader/request_method_count/POST': 3,
scraper_1 | 'downloader/response_bytes': 6839,
scraper_1 | 'downloader/response_count': 7,
scraper_1 | 'downloader/response_status_count/200': 1,
scraper_1 | 'downloader/response_status_count/500': 6,
scraper_1 | 'finish_reason': 'finished',
scraper_1 | 'finish_time': datetime.datetime(2017, 7, 11, 14, 5, 7, 866713),
scraper_1 | 'httperror/response_ignored_count': 1,
scraper_1 | 'httperror/response_ignored_status_count/500': 1,
scraper_1 | 'log_count/DEBUG': 10,
scraper_1 | 'log_count/INFO': 8,
scraper_1 | 'memusage/max': 49065984,
scraper_1 | 'memusage/startup': 49065984,
scraper_1 | 'response_received_count': 3,
scraper_1 | 'retry/count': 4,
scraper_1 | 'retry/max_reached': 2,
scraper_1 | 'retry/reason_count/500 Internal Server Error': 4,
scraper_1 | 'scheduler/dequeued': 4,
scraper_1 | 'scheduler/dequeued/memory': 4,
scraper_1 | 'scheduler/enqueued': 4,
scraper_1 | 'scheduler/enqueued/memory': 4,
scraper_1 | 'splash/render.html/request_count': 1,
scraper_1 | 'splash/render.html/response_count/500': 3,
scraper_1 | 'start_time': datetime.datetime(2017, 7, 11, 14, 4, 46, 717691)}
scraper_1 | 2017-07-11 14:05:07 [scrapy.core.engine] INFO: Spider closed (finished)
apksplashcompose_scraper_1 exited with code 0
简而言之,当使用 Splash 呈现 JavaScript 时,我无法成功使用 HttpProxyMiddleware 以便通过 Privoxy 也使用 Tor。有人能看出这里出了什么问题吗?
更新
根据 Paul 的评论,我尝试按如下方式调整 splash
服务:
splash:
image: scrapinghub/splash
volumes:
- ./splash/proxy-profiles:/etc/splash/proxy-profiles
我在主目录中添加了一个 'splash' 目录,如下所示:
.
├── apk_splash
├── docker-compose.yml
└── splash
└── proxy-profiles
└── proxy.ini
和proxy.ini
读取
[proxy]
host=tor-privoxy
port=8118
据我了解,这应该使代理始终被使用(即 whitelist
默认为 ".*"
而没有 blacklist
)。
但是,如果我再次 docker-compose build
和 docker-compose up
,我仍然会收到 HTTP 500 错误。所以问题仍然是如何解决这些问题?
(顺便说一下,这个问题似乎与 https://github.com/scrapy-plugins/scrapy-splash/issues/117 相似;但是,我没有使用 Crawlera,所以我不确定如何调整答案)。
更新 2
在 Paul 的第二条评论之后,我检查了 tor-privoxy
通过这样做在容器内解析(虽然它仍然是 运行):
~$ docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
04909e6ef5cb apksplashcompose_scraper "scrapy crawl apkm..." 2 hours ago Up 8 seconds apksplashcompose_scraper_1
~$ docker exec -it $(docker ps -lq) /bin/bash
bash-4.3# python
Python 3.6.1 (default, Jun 19 2017, 23:58:41)
[GCC 5.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostbyname('tor-privoxy')
'172.22.0.2'
至于我是如何运行 Splash的,是通过链接容器,类似于https://splash.readthedocs.io/en/stable/install.html#docker-folder-sharing中描述的方式。我已验证 /etc/splash/proxy-profiles/proxy.ini
存在于容器中:
~$ docker exec -it apksplashcompose_splash_1 /bin/bash
root@b091fbef4c78:/# cd /etc/splash/proxy-profiles
root@b091fbef4c78:/etc/splash/proxy-profiles# ls
proxy.ini
root@b091fbef4c78:/etc/splash/proxy-profiles# cat proxy.ini
[proxy]
host=tor-privoxy
port=8118
我会尝试 Aquarium,但问题仍然是为什么当前设置不起作用?
结构如下Aquarium project as suggested by paul trmbrth, I found that it is essential to name the .ini file default.ini
, not proxy.ini
(otherwise it doesn't get 'picked up' automatically). I managed to get the scraper to work in this way (cf. my self-answer to