运行 使用 crawlerprocess 时 Scrapy 抛出错误
Scrapy throws an error when run using crawlerprocess
我在 python 中编写了一个脚本,使用 scrapy 从网站收集不同帖子的名称及其链接。当我从命令行执行我的脚本时,它完美地工作。现在,我的意图是 运行 使用 CrawlerProcess()
的脚本。我在不同的地方寻找类似的问题,但我找不到任何直接的解决方案或更接近的解决方案。但是,当我尝试按原样 运行 时,出现以下错误:
from Whosebug.items import WhosebugItem
ModuleNotFoundError: No module named 'Whosebug'
到目前为止,这是我的脚本 (Whosebugspider.py
):
from scrapy.crawler import CrawlerProcess
from Whosebug.items import WhosebugItem
from scrapy import Selector
import scrapy
class Whosebugspider(scrapy.Spider):
name = 'Whosebug'
start_urls = ['https://whosebug.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = WhosebugItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(Whosebugspider)
c.start()
items.py
包括:
import scrapy
class WhosebugItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
这是树:
Click to see the hierarchy
I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:
def parse(self,response):
for link in sel.xpath("//*[@class='question-hyperlink']"):
name = link.xpath('.//text()').extract_first()
url = link.xpath('.//@href').extract_first()
yield {"Name":name,"Link":url}
这是一个python路径问题。
最简单的方法是调用它显式设置 python 路径,即从包含 scrapy.cfg 的目录(更重要的是 Whosebug 模块)运行:
PYTHONPATH=. python3 Whosebug/spiders/Whosebugspider.py
这会将 python 路径设置为包含当前目录 (.)。
尽管@Dan-Dev 向我展示了一条正确的方向,但我还是决定提供一个对我来说完美无缺的完整解决方案。
除了我在下面粘贴的内容外,其他任何地方都没有改变:
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\Whosebug')
from scrapy.crawler import CrawlerProcess
from Whosebug.items import WhosebugItem
from scrapy import Selector
import scrapy
class Whosebugspider(scrapy.Spider):
name = 'Whosebug'
start_urls = ['https://whosebug.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = WhosebugItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(Whosebugspider)
c.start()
再一次,在脚本中包含以下内容解决了问题
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\Whosebug')
我在 python 中编写了一个脚本,使用 scrapy 从网站收集不同帖子的名称及其链接。当我从命令行执行我的脚本时,它完美地工作。现在,我的意图是 运行 使用 CrawlerProcess()
的脚本。我在不同的地方寻找类似的问题,但我找不到任何直接的解决方案或更接近的解决方案。但是,当我尝试按原样 运行 时,出现以下错误:
from Whosebug.items import WhosebugItem ModuleNotFoundError: No module named 'Whosebug'
到目前为止,这是我的脚本 (Whosebugspider.py
):
from scrapy.crawler import CrawlerProcess
from Whosebug.items import WhosebugItem
from scrapy import Selector
import scrapy
class Whosebugspider(scrapy.Spider):
name = 'Whosebug'
start_urls = ['https://whosebug.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = WhosebugItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(Whosebugspider)
c.start()
items.py
包括:
import scrapy
class WhosebugItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
这是树: Click to see the hierarchy
I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:
def parse(self,response):
for link in sel.xpath("//*[@class='question-hyperlink']"):
name = link.xpath('.//text()').extract_first()
url = link.xpath('.//@href').extract_first()
yield {"Name":name,"Link":url}
这是一个python路径问题。 最简单的方法是调用它显式设置 python 路径,即从包含 scrapy.cfg 的目录(更重要的是 Whosebug 模块)运行:
PYTHONPATH=. python3 Whosebug/spiders/Whosebugspider.py
这会将 python 路径设置为包含当前目录 (.)。
尽管@Dan-Dev 向我展示了一条正确的方向,但我还是决定提供一个对我来说完美无缺的完整解决方案。
除了我在下面粘贴的内容外,其他任何地方都没有改变:
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\Whosebug')
from scrapy.crawler import CrawlerProcess
from Whosebug.items import WhosebugItem
from scrapy import Selector
import scrapy
class Whosebugspider(scrapy.Spider):
name = 'Whosebug'
start_urls = ['https://whosebug.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = WhosebugItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(Whosebugspider)
c.start()
再一次,在脚本中包含以下内容解决了问题
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\Whosebug')