退出：scrapy（退出状态 0；未预期）

Question

我尝试运行 bash 脚本在我的 Docker 容器中启动许多蜘蛛。我的 supervisor.conf 放在“/etc/supervisor/conf.d/” 看起来像这样：

[program:scrapy]                                                            
command=/tmp/start_spider.sh
autorestart=false
startretries=0
stderr_logfile=/tmp/start_spider.err.log
stdout_logfile=/tmp/start_spider.out.log

但是主管 return 这个错误：

2015-08-21 10:50:30,466 CRIT Supervisor running as root (no user in config file)

2015-08-21 10:50:30,466 WARN Included extra file "/etc/supervisor/conf.d/tor.conf" during parsing

2015-08-21 10:50:30,478 INFO RPC interface 'supervisor' initialized

2015-08-21 10:50:30,478 CRIT Server 'unix_http_server' running without any HTTP authentication checking

2015-08-21 10:50:30,478 INFO supervisord started with pid 5

2015-08-21 10:50:31,481 INFO spawned: 'scrapy' with pid 8

2015-08-21 10:50:31,555 INFO exited: scrapy (exit status 0; not expected)

2015-08-21 10:50:32,557 INFO gave up: scrapy entered FATAL state, too many start retries too quickly

我的程序停止到运行ning。但是如果我手动运行我的程序，它运行得很好 ...

如何解决这个问题？有什么想法吗？

Answer 1

这是我的代码：

start_spider.sh

#!/bin/bash

# list letter
parseLetter=('a' 'b')


# change path
cd $path/scrapy/scrapyTodo/scrapyTodo

tLen=${#parseLetter[@]}
for (( i=0; i<${tLen}; i++ ));
do
    scrapy crawl root -a alpha=${parseLetter[$i]} &
done

这是我的 scrapy 代码：

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-

from scrapy.selector import Selector
from elasticsearch import Elasticsearch
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
from urlparse import urljoin
from bs4 import BeautifulSoup
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
from tools import sendEmail
from tools import ElasticAction
from tools import runlog
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from datetime import datetime
import re

class studentCrawler(BaseSpider):
    # Crawling Start
    CrawlSpider.started_on = datetime.now()

    name = "root"


    DOWNLOAD_DELAY = 0

    allowed_domains = ['website.com']

    ES_Index = "website"
    ES_Type = "root"
    ES_Ip = "127.0.0.1"

    child_type = "level1"

    handle_httpstatus_list = [404, 302, 503, 999, 200] #add any other code you need

    es = ElasticAction(ES_Index, ES_Type, ES_Ip)

    # Init
    def __init__(self, alpha=''):

        base_domain = 'https://www.website.com/directory/student-' + str(alpha) + "/"

        self.start_urls = [base_domain]
        super(CompanyCrawler, self).__init__(self.start_urls)


    def is_empty(self, any_structure):
        """
        Function that allow to check if the data is empty or not
        :arg any_structure: any data
        """
        if any_structure:
            return 1
        else:
            return 0

    def parse(self, response):
        """
        main method that parse the web page
        :param response:
        :return:
        """

        if response.status == 404:
            self.es.insertIntoES(response.url, "False")
        if str(response.status) == "503":
            self.es.insertIntoES(response.url, "False")
        if response.status == 999:
            self.es.insertIntoES(response.url, "False")

        if str(response.status) == "200":
            # Selector
            sel = Selector(response)

            self.es.insertIntoES(response.url, "True")
            body = self.getAllTheUrl('u'.join(sel.xpath(".//*[@id='seo-dir']/div/div[3]").extract()).strip(),response.url )


    def getAllTheUrl(self, data, parent_id):
        dictCompany = dict()
        soup = BeautifulSoup(data,'html.parser')
        for a in soup.find_all('a', href=True):
            self.es.insertChildAndParent(self.child_type, str(a['href']), "False", parent_id)

我发现当管理员启动蜘蛛时BeautifulSoup不工作....

Answer 2

我找到了解决问题的方法。对于 supervisor.conf，更改

[program:scrapy]                                                       
command=/tmp/start_spider.sh
autorestart=false
startretries=0

作者：

[program:scrapy]
command=/bin/bash -c "exec /tmp/start_spider.sh > /dev/null 2>&1 -DFOREGROUND"
autostart=true
autorestart=false
startretries=0

退出：scrapy（退出状态 0；未预期）

exited: scrapy (exit status 0; not expected)

python

scrapy

supervisord

docker