如何在scrapy中将信息从一种方法传递到另一种方法
How to pass information from one method to another in scrapy
我正在从一个网站上抓取数据,该网站要求我从各个候选人资料中获取数据。问题是,一部分数据将从配置文件片段中提取,其余数据必须在输入配置文件后提取。
要使用代码段提取的字段是:
1.工作许可
2. 候选人姓名
3. 图片ID
打开配置文件后即可提取其余数据。
问题:
我写了一个spider,想把上述字段的数据从一种方法传递到另一种方法。现在,当我抓取我的蜘蛛时,我会为特定页面上的所有候选配置文件重复这三个字段的数据。我实际上是网络抓取和 python 的新手。你能帮帮我吗?
我附上我的爬虫代码和 items.py 文件以供参考:
import scrapy
from urllib.parse import urljoin
from hbs_candidates.items import HbsCandidatesItem
domain = 'https://www.myvisajobs.com'
url = 'https://www.myvisajobs.com/CV/Search.aspx?DG=Bachelor&P=1'
page_scraped = 2
classes = ['HighLight: ', 'Membership: ', 'Honor: ', 'Skills: ', 'Degree: ', 'Career Level: ', 'Certification: ','Occupation: ', 'Reference: ', 'Target Locations: ', 'Career Title: ', 'Goal: ', 'Target Title:']
class InfoSpider(scrapy.Spider):
name = 'inform'
start_urls = [url]
# page_no = 1
def parse(self, response):
wa_temp = []
items = HbsCandidatesItem()
tables = response.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr""")
names_temp = tables.css('b a::text').extract()
images_temp = [domain + x for x in response.css('img::attr(src)').extract()[1:]]
for i in range(len(tables)):
wa = str(tables.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr[3]/td[2]/text()[6]""").extract()).split('Work Authorization: ')[1]
if wa is not None:
temp_wa = wa
else:
temp_wa = 'N/A'
wa_temp.append(temp_wa)
my_list = response.css('b a::attr(href)').extract()
for i in range(len(my_list)):
url_final = urljoin(url, my_list[i])
temp_url = response.urljoin(url_final)
items['Candidate Name'] = names_temp[i]
items['Image ID'] = images_temp[i]
items['Work Authorization'] = wa_temp[i]
request = scrapy.Request(temp_url, callback=self.parse_can_contents)
request.cb_kwargs['items'] = items
yield request
def parse_can_contents(self, response, items):
### code to scrape data from profile page and assigning values to
items
-----------
-------------
## I want to access the values passed from parse method here
yield items
items.py代码:
from scrapy.item import Item, Field
class HbsCandidatesItem(Item):
def __setitem__(self, key, value):
if key not in self.fields:
self.fields[key] = Field()
self._values[key] = value
我希望这是清楚的。请询问这个问题是否模棱两可。谢谢!
应在 for 循环内创建项目 (items = HbsCandidatesItem())
for i in range(len(my_list)):
url_final = urljoin(url, my_list[i])
temp_url = response.urljoin(url_final)
items = HbsCandidatesItem()
items['Candidate Name'] = names_temp[i]
items['Image ID'] = images_temp[i]
items['Work Authorization'] = wa_temp[i]
request = scrapy.Request(temp_url, callback=self.parse_can_contents)
request.cb_kwargs['items'] = items
yield request
我正在从一个网站上抓取数据,该网站要求我从各个候选人资料中获取数据。问题是,一部分数据将从配置文件片段中提取,其余数据必须在输入配置文件后提取。
要使用代码段提取的字段是: 1.工作许可 2. 候选人姓名 3. 图片ID
打开配置文件后即可提取其余数据。
问题:
我写了一个spider,想把上述字段的数据从一种方法传递到另一种方法。现在,当我抓取我的蜘蛛时,我会为特定页面上的所有候选配置文件重复这三个字段的数据。我实际上是网络抓取和 python 的新手。你能帮帮我吗?
我附上我的爬虫代码和 items.py 文件以供参考:
import scrapy
from urllib.parse import urljoin
from hbs_candidates.items import HbsCandidatesItem
domain = 'https://www.myvisajobs.com'
url = 'https://www.myvisajobs.com/CV/Search.aspx?DG=Bachelor&P=1'
page_scraped = 2
classes = ['HighLight: ', 'Membership: ', 'Honor: ', 'Skills: ', 'Degree: ', 'Career Level: ', 'Certification: ','Occupation: ', 'Reference: ', 'Target Locations: ', 'Career Title: ', 'Goal: ', 'Target Title:']
class InfoSpider(scrapy.Spider):
name = 'inform'
start_urls = [url]
# page_no = 1
def parse(self, response):
wa_temp = []
items = HbsCandidatesItem()
tables = response.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr""")
names_temp = tables.css('b a::text').extract()
images_temp = [domain + x for x in response.css('img::attr(src)').extract()[1:]]
for i in range(len(tables)):
wa = str(tables.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr[3]/td[2]/text()[6]""").extract()).split('Work Authorization: ')[1]
if wa is not None:
temp_wa = wa
else:
temp_wa = 'N/A'
wa_temp.append(temp_wa)
my_list = response.css('b a::attr(href)').extract()
for i in range(len(my_list)):
url_final = urljoin(url, my_list[i])
temp_url = response.urljoin(url_final)
items['Candidate Name'] = names_temp[i]
items['Image ID'] = images_temp[i]
items['Work Authorization'] = wa_temp[i]
request = scrapy.Request(temp_url, callback=self.parse_can_contents)
request.cb_kwargs['items'] = items
yield request
def parse_can_contents(self, response, items):
### code to scrape data from profile page and assigning values to
items
-----------
-------------
## I want to access the values passed from parse method here
yield items
items.py代码:
from scrapy.item import Item, Field
class HbsCandidatesItem(Item):
def __setitem__(self, key, value):
if key not in self.fields:
self.fields[key] = Field()
self._values[key] = value
我希望这是清楚的。请询问这个问题是否模棱两可。谢谢!
应在 for 循环内创建项目 (items = HbsCandidatesItem())
for i in range(len(my_list)):
url_final = urljoin(url, my_list[i])
temp_url = response.urljoin(url_final)
items = HbsCandidatesItem()
items['Candidate Name'] = names_temp[i]
items['Image ID'] = images_temp[i]
items['Work Authorization'] = wa_temp[i]
request = scrapy.Request(temp_url, callback=self.parse_can_contents)
request.cb_kwargs['items'] = items
yield request