运行 使用 crawlerprocess 时 Scrapy 抛出错误

Scrapy throws an error when run using crawlerprocess

我在 python 中编写了一个脚本,使用 scrapy 从网站收集不同帖子的名称及其链接。当我从命令行执行我的脚本时,它完美地工作。现在,我的意图是 运行 使用 CrawlerProcess() 的脚本。我在不同的地方寻找类似的问题,但我找不到任何直接的解决方案或更接近的解决方案。但是,当我尝试按原样 运行 时,出现以下错误:

from Whosebug.items import WhosebugItem ModuleNotFoundError: No module named 'Whosebug'

到目前为止,这是我的脚本 (Whosebugspider.py):

from scrapy.crawler import CrawlerProcess
from Whosebug.items import WhosebugItem
from scrapy import Selector
import scrapy

class Whosebugspider(scrapy.Spider):
    name = 'Whosebug'
    start_urls = ['https://whosebug.com/questions/tagged/web-scraping']

    def parse(self,response):
        sel = Selector(response)
        items = []
        for link in sel.xpath("//*[@class='question-hyperlink']"):
            item = WhosebugItem()
            item['name'] = link.xpath('.//text()').extract_first()
            item['url'] = link.xpath('.//@href').extract_first()
            items.append(item)
        return items

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(Whosebugspider)
    c.start()

items.py 包括:

import scrapy

class WhosebugItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

这是树: Click to see the hierarchy

I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:

def parse(self,response):
    for link in sel.xpath("//*[@class='question-hyperlink']"):
        name = link.xpath('.//text()').extract_first()
        url = link.xpath('.//@href').extract_first()
        yield {"Name":name,"Link":url}

这是一个python路径问题。 最简单的方法是调用它显式设置 python 路径,即从包含 scrapy.cfg 的目录(更重要的是 Whosebug 模块)运行:

PYTHONPATH=. python3 Whosebug/spiders/Whosebugspider.py

这会将 python 路径设置为包含当前目录 (.)。

替代方案参见https://www.daveoncode.com/2017/03/07/how-to-solve-python-modulenotfound-no-module-named-import-error/

尽管@Dan-Dev 向我展示了一条正确的方向,但我还是决定提供一个对我来说完美无缺的完整解决方案。

除了我在下面粘贴的内容外,其他任何地方都没有改变:

import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\Whosebug')
from scrapy.crawler import CrawlerProcess
from Whosebug.items import WhosebugItem
from scrapy import Selector
import scrapy


class Whosebugspider(scrapy.Spider):
    name = 'Whosebug'
    start_urls = ['https://whosebug.com/questions/tagged/web-scraping']

    def parse(self,response):
        sel = Selector(response)
        items = []
        for link in sel.xpath("//*[@class='question-hyperlink']"):
            item = WhosebugItem()
            item['name'] = link.xpath('.//text()').extract_first()
            item['url'] = link.xpath('.//@href').extract_first()
            items.append(item)
        return items

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(Whosebugspider)
    c.start()

再一次,在脚本中包含以下内容解决了问题

import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\Whosebug')