Twisted Reactor 未在 scrapy 中重新启动

Twisted Reactor not restarting in scrapy

我正在尝试使用 python-telegram-bot API 包装器通过 Telegram 机器人 运行 一个 scrapy 蜘蛛。使用下面的代码,我可以成功地执行蜘蛛并将抓取的结果转发给机器人,但只有一次,因为我 运行 脚本。当我尝试通过机器人(电报机器人命令)重新执行蜘蛛时,我收到错误 twisted.internet.error.ReactorNotRestartable

from twisted.internet import reactor
from scrapy import cmdline
from telegram.ext import Updater, CommandHandler, MessageHandler, Filters, RegexHandler
import logging
import os
import ConfigParser
import json
import textwrap
from MIS.spiders.moodle_spider import MySpider
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner, CrawlerProcess
from scrapy.utils.log import configure_logging


# Read settings from config file
config = ConfigParser.RawConfigParser()
config.read('./spiders/creds.ini')
TOKEN = config.get('BOT', 'TOKEN')
#APP_NAME = config.get('BOT', 'APP_NAME')
#PORT = int(os.environ.get('PORT', '5000'))
updater = Updater(TOKEN)

# Setting Webhook
#updater.start_webhook(listen="0.0.0.0",
#                      port=PORT,
#                      url_path=TOKEN)
#updater.bot.setWebhook(APP_NAME + TOKEN)

logging.basicConfig(format='%(asctime)s -# %(name)s - %(levelname)s - %(message)s',level=logging.INFO)

dispatcher = updater.dispatcher

# Real stuff

def doesntRun(bot, update):
    #process = CrawlerProcess(get_project_settings())
    #process.crawl(MySpider)
    #process.start()
    ############

    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner({
        'FEED_FORMAT' : 'json',
        'FEED_URI' : 'output.json'
        })

    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished
    #reactor.stop()

    with open("./output.json", 'r') as file:
        contents = file.read()
        a_r = json.loads(contents)
        AM = a_r[0]['AM']
        ...
        ...

        message_template = textwrap.dedent("""
                AM: {AM}
                ...
                """)
        messageContent = message_template.format(AM=AM, ...)
        #print messageContent
        bot.sendMessage(chat_id=update.message.chat_id, text=messageContent)
        #reactor.stop()


# Handlers
test_handler = CommandHandler('doesntRun', doesntRun)

# Dispatchers
dispatcher.add_handler(test_handler)

updater.start_polling()
updater.idle()

我正在使用文档中的代码:https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

代码如下:

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

好的,我的问题终于解决了

Python-telegram-bot API wrapper 提供 an easy way to restart the bot.

我简单地写了这几行:

time.sleep(0.2)
os.execl(sys.executable, sys.executable, *sys.argv)

在 doentRun() 函数的末尾。现在,每当我通过 bot 调用该函数时,它都会抓取页面、存储结果、转发结果,然后自行重启。这样做可以让我任意多次执行蜘蛛程序。