如何将 运行 scrapy 的默认设置设置为 python 脚本?

How to set default settings for running scrapy as a python script?

我想 运行 scrapy 作为 python 脚本,但我不知道如何正确设置设置或如何提供它们。我不确定这是否是设置问题,但我认为是。

我的配置:

我听取了 https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script 的建议,得到了它 运行ning。我对以下建议有一些疑问:

If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.

那么 "inside a Scrapy project" 是什么意思?当然,我必须导入库并安装依赖项,但我想避免使用 scrapy crawl xyz.

开始抓取过程

这是myScrapy.py

的代码
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse


#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')

parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url

#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
    urlToScanNoProt = urlToScan.replace("https://","")
    print "used protocol: https"
if "http" in urlToScan:
    urlToScanNoProt = urlToScan.replace("http://","")
    print "used protocol: http"

class myItem(Item):
    url = Field()

class mySpider(CrawlSpider):
    name = "linkspider"
    allowed_domains = [urlToScanNoProt]
    start_urls = [urlToScan,]
    rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )

    def generateDirs(self):
        if not os.path.exists(generalOutputDir):
            os.makedirs(generalOutputDir)
        specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
        if not os.path.exists(specificOutputDir):
            os.makedirs(specificOutputDir)
        return specificOutputDir

    def parse_url(self, response):
        for link in LinkExtractor().extract_links(response):
            item = myItem()
            item['url'] = response.url
        specificOutputDir = self.generateDirs()
        filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
        with open(filename, "wb") as f:
            f.write(response.body)
        return CrawlSpider.parse(self, response)
        return item

process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished

为什么我必须调用 process.crawl(mySpider) 而不是 process.crawl(linkspider)?我认为获取设置是一个问题,因为它们是在 "normal" scrapy-project(你必须 运行 scrapy crawl xyz)中设置的,因为 putput 说 2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {} 我希望你能理解我的问题(英语不是我的母语......;)) 提前致谢!

当 运行使用脚本(而不是 scrapy crawl)进行爬网时,其中一个选项确实是使用 CrawlerProcess.

So what is meant wiht "inside a Scrapy project"?

这意味着如果您 运行 您的脚本位于使用 scrapy startproject 创建的 scrapy 项目的根目录下,即您拥有 scrapy.cfg 文件和 [settings] 部分等等。

Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)?

阅读the documentation on scrapy.crawler.CrawlerProcess.crawl() for details:

Parameters:
crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it

我不知道框架的那部分,但我怀疑只有一个蜘蛛名称——我相信你的意思是 而不是 process.crawl("linkspider") ,而且在外面对于一个 scrapy 项目,scrapy 不知道去哪里寻找蜘蛛(它没有提示)。因此,要告诉 scrapy 哪个蜘蛛到 运行,不妨直接给出 class( 而不是蜘蛛实例 class)。

get_project_settings() 是一个帮助程序,但本质上,CrawlerProcess 需要使用 Settings 对象进行初始化(参见 https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess

其实它也接受一个设置dict(也就是internally converted into a Settings instance), as shown in the example you linked to:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

因此,与 scrapy 默认值相比,您需要覆盖哪些设置,您需要执行以下操作:

process = CrawlerProcess({
    'SOME_SETTING_KEY': somevalue,
    'SOME_OTHERSETTING_KEY': someothervalue,
    ...
})
process.crawl(mySpider)
...