如何将网站地址从另一个 python 脚本传递给 SpiderClass
How to pass website address to SpiderClass from another python script
我需要将登录名 URL 从一个 class 传递给蜘蛛 Class 并对其执行网络抓取。
import quotes as q
import scrapy
from scrapy.crawler import CrawlerProcess
class ValidateURL:
def checkURL(self,urls):
try:
if(urls):
for key, value in urls.items():
if value['login_details']:
self.runScrap(value)
except:
return False
def runScrap(self,data):
if data:
process = CrawlerProcess()
# here I'm passing a URL (mail.google.com)
process.crawl(q.QuotesSpider, passed_url=data['url'])
process.start()
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
import sys
import logging
from bs4 import BeautifulSoup
# import scrapy
# from scrapy.crawler import CrawlerProcess
logging.basicConfig(filename='app.log',level=logging.INFO)
class QuotesSpider(Spider):
name = 'quotes'
# I need to update this with passed variable
start_urls = ('https://quotes.toscrape.com/login',)
def parse(self, response):
pass
def scrape_pages(self, response):
pass
我的代码不言自明,需要用传递的参数更新 superclass 变量。我该如何实施?我尝试使用 self.passed_url
,但只能在函数内部访问,无法获得更新。
您需要将传递的参数名称与 spider start_urls
属性相匹配。
根据docs,如果你不覆盖蜘蛛的__init__
方法,所有传递给蜘蛛的参数class都会映射到蜘蛛属性。因此,为了覆盖 start_urls
属性,您需要发送准确的参数名称。
像这样:
def runScrap(self,data):
if data:
process = CrawlerProcess()
process.crawl(q.QuotesSpider, start_urls=[data['url']])
process.start()
希望对您有所帮助。
我需要将登录名 URL 从一个 class 传递给蜘蛛 Class 并对其执行网络抓取。
import quotes as q
import scrapy
from scrapy.crawler import CrawlerProcess
class ValidateURL:
def checkURL(self,urls):
try:
if(urls):
for key, value in urls.items():
if value['login_details']:
self.runScrap(value)
except:
return False
def runScrap(self,data):
if data:
process = CrawlerProcess()
# here I'm passing a URL (mail.google.com)
process.crawl(q.QuotesSpider, passed_url=data['url'])
process.start()
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
import sys
import logging
from bs4 import BeautifulSoup
# import scrapy
# from scrapy.crawler import CrawlerProcess
logging.basicConfig(filename='app.log',level=logging.INFO)
class QuotesSpider(Spider):
name = 'quotes'
# I need to update this with passed variable
start_urls = ('https://quotes.toscrape.com/login',)
def parse(self, response):
pass
def scrape_pages(self, response):
pass
我的代码不言自明,需要用传递的参数更新 superclass 变量。我该如何实施?我尝试使用 self.passed_url
,但只能在函数内部访问,无法获得更新。
您需要将传递的参数名称与 spider start_urls
属性相匹配。
根据docs,如果你不覆盖蜘蛛的__init__
方法,所有传递给蜘蛛的参数class都会映射到蜘蛛属性。因此,为了覆盖 start_urls
属性,您需要发送准确的参数名称。
像这样:
def runScrap(self,data):
if data:
process = CrawlerProcess()
process.crawl(q.QuotesSpider, start_urls=[data['url']])
process.start()
希望对您有所帮助。