使用 scrapy with splash 抓取 LinkedIn 时出现 502 错误
502 error when scraping LinkedIn using scrapy with splash
我尝试使用带有 Splash 的 Scrapy 抓取 Netflix 的 Linkedin 公司页面。当我使用 scrapy shell 时它工作得很好但是当我 运行 脚本时给出 502 错误。
错误:
2017-01-06 16:06:45 [scrapy.core.engine] INFO: Spider opened
2017-01-06 16:06:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-06 16:06:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (failed 1 times): 502 Bad Gateway
2017-01-06 16:06:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (failed 2 times): 502 Bad Gateway
2017-01-06 16:07:05 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
2017-01-06 16:07:05 [scrapy.core.engine] DEBUG: Crawled (502) <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (referer: None)
2017-01-06 16:07:05 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 https://www.linkedin.com/company/netflix>: HTTP status code is not handled or not allowed
2017-01-06 16:07:05 [scrapy.core.engine] INFO: Closing spider (finished)
在启动终端中:
2017-01-06 10:36:52.186410 [render] [139764812670456] loadFinished: RenderErrorInfo(type='HTTP', code=999, text='Request denied', url='https://www.linkedin.com/company/netflix')
2017-01-06 10:36:52.205523 [events] {"fds": 18, "qsize": 0, "args": {"url": "https://www.linkedin.com/company/netflix", "headers": {"User-Agent": "Scrapy/1.3.0 (+http://scrapy.org)", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en"}, "uid": 139764812670456, "wait": 0.5}, "rendertime": 6.674675464630127, "timestamp": 1483699012, "user-agent": "Scrapy/1.3.0 (+http://scrapy.org)", "maxrss": 87956, "error": {"info": {"url": "https://www.linkedin.com/company/netflix", "code": 999, "type": "HTTP", "text": "Request denied"}, "error": 502, "description": "Error rendering page", "type": "RenderError"}, "active": 0, "load": [0.51, 0.67, 0.8], "status_code": 502, "client_ip": "172.17.0.1", "method": "POST", "_id": 139764812670456, "path": "/render.html"}
2017-01-06 10:36:52.206259 [-] "172.17.0.1" - - [06/Jan/2017:10:36:51 +0000] "POST /render.html HTTP/1.1" 502 192 "-" "Scrapy/1.3.0 (+http://scrapy.org)"
蜘蛛代码:
import scrapy
from scrapy_splash import SplashRequest
from linkedin.items import LinkedinItem
class LinkedinScrapy(scrapy.Spider):
name = 'linkedin_spider' # spider name
allowed_domains = ['linkedin.com']
start_urls = ['https://www.linkedin.com/company/netflix']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html', args={'wait': 0.5)
def parse(self, response):
item = LinkedinItem()
item['name'] = response.xpath('//*[@id="stream-promo-top-bar"]/div[2]/div[1]/div[1]/div/h1/span/text()').extract_first()
item['followers'] = response.xpath('//*[@id = "biz-follow-mod"]/div/div/div/p/text()').extract_first().split()[0]
item['description'] = response.xpath('//*[@id="stream-about-section"]/div[2]/div[1]/div/p/text()').extract_first()
yield item
这可能是因为 LinkedIn 拒绝访问,因为您的请求正在使用用户代理字符串:
"User-Agent": "Scrapy/1.3.0 (+http://scrapy.org)"
您应该将蜘蛛中的用户代理更改为其他内容,请参阅 mozillas documentation for that
我尝试使用带有 Splash 的 Scrapy 抓取 Netflix 的 Linkedin 公司页面。当我使用 scrapy shell 时它工作得很好但是当我 运行 脚本时给出 502 错误。
错误:
2017-01-06 16:06:45 [scrapy.core.engine] INFO: Spider opened
2017-01-06 16:06:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-06 16:06:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (failed 1 times): 502 Bad Gateway
2017-01-06 16:06:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (failed 2 times): 502 Bad Gateway
2017-01-06 16:07:05 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
2017-01-06 16:07:05 [scrapy.core.engine] DEBUG: Crawled (502) <GET https://www.linkedin.com/company/netflix via http://localhost:8050/render.html> (referer: None)
2017-01-06 16:07:05 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 https://www.linkedin.com/company/netflix>: HTTP status code is not handled or not allowed
2017-01-06 16:07:05 [scrapy.core.engine] INFO: Closing spider (finished)
在启动终端中:
2017-01-06 10:36:52.186410 [render] [139764812670456] loadFinished: RenderErrorInfo(type='HTTP', code=999, text='Request denied', url='https://www.linkedin.com/company/netflix')
2017-01-06 10:36:52.205523 [events] {"fds": 18, "qsize": 0, "args": {"url": "https://www.linkedin.com/company/netflix", "headers": {"User-Agent": "Scrapy/1.3.0 (+http://scrapy.org)", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en"}, "uid": 139764812670456, "wait": 0.5}, "rendertime": 6.674675464630127, "timestamp": 1483699012, "user-agent": "Scrapy/1.3.0 (+http://scrapy.org)", "maxrss": 87956, "error": {"info": {"url": "https://www.linkedin.com/company/netflix", "code": 999, "type": "HTTP", "text": "Request denied"}, "error": 502, "description": "Error rendering page", "type": "RenderError"}, "active": 0, "load": [0.51, 0.67, 0.8], "status_code": 502, "client_ip": "172.17.0.1", "method": "POST", "_id": 139764812670456, "path": "/render.html"}
2017-01-06 10:36:52.206259 [-] "172.17.0.1" - - [06/Jan/2017:10:36:51 +0000] "POST /render.html HTTP/1.1" 502 192 "-" "Scrapy/1.3.0 (+http://scrapy.org)"
蜘蛛代码:
import scrapy
from scrapy_splash import SplashRequest
from linkedin.items import LinkedinItem
class LinkedinScrapy(scrapy.Spider):
name = 'linkedin_spider' # spider name
allowed_domains = ['linkedin.com']
start_urls = ['https://www.linkedin.com/company/netflix']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html', args={'wait': 0.5)
def parse(self, response):
item = LinkedinItem()
item['name'] = response.xpath('//*[@id="stream-promo-top-bar"]/div[2]/div[1]/div[1]/div/h1/span/text()').extract_first()
item['followers'] = response.xpath('//*[@id = "biz-follow-mod"]/div/div/div/p/text()').extract_first().split()[0]
item['description'] = response.xpath('//*[@id="stream-about-section"]/div[2]/div[1]/div/p/text()').extract_first()
yield item
这可能是因为 LinkedIn 拒绝访问,因为您的请求正在使用用户代理字符串:
"User-Agent": "Scrapy/1.3.0 (+http://scrapy.org)"
您应该将蜘蛛中的用户代理更改为其他内容,请参阅 mozillas documentation for that