我怎样才能用 scrapy 将 2 个蜘蛛的所有结果放在一个 XML 中?
How can I put all the results of 2 spiders in one XML with scrapy?
我用 Scrapy 制作了 2 个蜘蛛,我需要将它们放在一个脚本中并将所有结果放在一个 XML。
这里有一些方法可以将 2 个蜘蛛合二为一,但我无法将结果放在一个中 XML。
http://doc.scrapy.org/en/latest/topics/practices.html
有什么方法可以用 2 个蜘蛛启动 1 个脚本并将所有结果收集到一个文件中吗?
在您的 scrapy 项目中创建一个名为 script.py 的 python 脚本,添加
以下行的 code.Assuming 蜘蛛文件的名称是 spider_one.py 和 spider_two.py,您的蜘蛛分别是 SpiderOne 和 SpiderTwo。
所以在你 script.py 添加。
from spiders.spider_one import SpiderOne
from spiders.spider_two import SpiderTwo
# scrapy api
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
file = "your_file.json" #your results
TO_CRAWL = [SpiderOne,SpiderTwo]
# list of crawlers that are running
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
settings = Settings()
settings.set("USER_AGENT", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
settings.set("FEED_FORMAT",'json')
settings.set("FEED_URI",file)
# settings.set("ITEM_PIPELINES",{ 'pipelines.CustomPipeline': 300})
settings.set("DOWNLOAD_DELAY",1)
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
# blocks process so always keep as the last statement
reactor.run()
该示例适用于 json,但也可以适当地适应 xml。
我用 Scrapy 制作了 2 个蜘蛛,我需要将它们放在一个脚本中并将所有结果放在一个 XML。
这里有一些方法可以将 2 个蜘蛛合二为一,但我无法将结果放在一个中 XML。
http://doc.scrapy.org/en/latest/topics/practices.html
有什么方法可以用 2 个蜘蛛启动 1 个脚本并将所有结果收集到一个文件中吗?
在您的 scrapy 项目中创建一个名为 script.py 的 python 脚本,添加 以下行的 code.Assuming 蜘蛛文件的名称是 spider_one.py 和 spider_two.py,您的蜘蛛分别是 SpiderOne 和 SpiderTwo。 所以在你 script.py 添加。
from spiders.spider_one import SpiderOne
from spiders.spider_two import SpiderTwo
# scrapy api
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
file = "your_file.json" #your results
TO_CRAWL = [SpiderOne,SpiderTwo]
# list of crawlers that are running
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
settings = Settings()
settings.set("USER_AGENT", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
settings.set("FEED_FORMAT",'json')
settings.set("FEED_URI",file)
# settings.set("ITEM_PIPELINES",{ 'pipelines.CustomPipeline': 300})
settings.set("DOWNLOAD_DELAY",1)
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
# blocks process so always keep as the last statement
reactor.run()
该示例适用于 json,但也可以适当地适应 xml。