如何在scrapy中将信息从一种方法传递到另一种方法

How to pass information from one method to another in scrapy

我正在从一个网站上抓取数据,该网站要求我从各个候选人资料中获取数据。问题是,一部分数据将从配置文件片段中提取,其余数据必须在输入配置文件后提取。

要使用代码段提取的字段是: 1.工作许可 2. 候选人姓名 3. 图片ID

打开配置文件后即可提取其余数据。

问题:

我写了一个spider,想把上述字段的数据从一种方法传递到另一种方法。现在,当我抓取我的蜘蛛时,我会为特定页面上的所有候选配置文件重复这三个字段的数据。我实际上是网络抓取和 python 的新手。你能帮帮我吗?

我附上我的爬虫代码和 items.py 文件以供参考:

import scrapy
from urllib.parse import urljoin
from hbs_candidates.items import HbsCandidatesItem

domain = 'https://www.myvisajobs.com'
url = 'https://www.myvisajobs.com/CV/Search.aspx?DG=Bachelor&P=1'
page_scraped = 2
classes = ['HighLight: ', 'Membership: ', 'Honor: ', 'Skills: ', 'Degree: ', 'Career Level: ', 'Certification: ','Occupation: ', 'Reference: ', 'Target Locations: ', 'Career Title: ', 'Goal: ', 'Target Title:']


class InfoSpider(scrapy.Spider):
    name = 'inform'
    start_urls = [url]
    # page_no = 1

    def parse(self, response):
        wa_temp = []
        items = HbsCandidatesItem()
        tables = response.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr""")
        names_temp = tables.css('b a::text').extract()
        images_temp = [domain + x for x in response.css('img::attr(src)').extract()[1:]]
        for i in range(len(tables)):
            wa = str(tables.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr[3]/td[2]/text()[6]""").extract()).split('Work Authorization: ')[1]
            if wa is not None:
                temp_wa = wa
            else:
                temp_wa = 'N/A'
            wa_temp.append(temp_wa)
        my_list = response.css('b a::attr(href)').extract()
        for i in range(len(my_list)):
            url_final = urljoin(url, my_list[i])
            temp_url = response.urljoin(url_final)
            items['Candidate Name'] = names_temp[i]
            items['Image ID'] = images_temp[i]
            items['Work Authorization'] = wa_temp[i]
            request = scrapy.Request(temp_url, callback=self.parse_can_contents)
            request.cb_kwargs['items'] = items
            yield request

    def parse_can_contents(self, response, items):
        ### code to scrape data from profile page and assigning values to 
        items
        -----------
        -------------

        ## I want to access the values passed from parse method here    
        yield items

items.py代码:

from scrapy.item import Item, Field


class HbsCandidatesItem(Item):
    def __setitem__(self, key, value):
        if key not in self.fields:
            self.fields[key] = Field()
        self._values[key] = value

我希望这是清楚的。请询问这个问题是否模棱两可。谢谢!

应在 for 循环内创建项目 (items = HbsCandidatesItem())

for i in range(len(my_list)):
        url_final = urljoin(url, my_list[i])
        temp_url = response.urljoin(url_final)
        items = HbsCandidatesItem()
        items['Candidate Name'] = names_temp[i]
        items['Image ID'] = images_temp[i]
        items['Work Authorization'] = wa_temp[i]
        request = scrapy.Request(temp_url, callback=self.parse_can_contents)
        request.cb_kwargs['items'] = items
        yield request