通过 Scrapy 中的元从连续的并行解析函数中产生值

Yielding values from consecutive parallel parse functions via meta in Scrapy

在我的 scrapy 代码中,我试图从列出所有议会成员 (MP) 的议会网站生成以下数据。打开每个 MP 的 links,我正在发出并行请求以获取我要计算的数字。我打算在国会议员的名字和政党的公司中产生以下每三个数字

这是我要抓取的数字

  1. 每位国会议员签名的议案数量
  2. 每位议员签名的问题提案数量
  3. 每位国会议员在议会中发言了多少次

为了计算并得出每位议员签名的法案数量,我正在尝试编写一个关于议员的 3 层爬虫:

我想要什么:我想在同一个raw中生成3a、3b、3c的查询,其中包含议员的姓名和党派 =53=]

  • 问题 1) 当我将输出输出到 csv 时,它只创建语音计数、名称、部分的字段。它没有显示法案提案和问题提案的字段

  • 问题2)每个MP有两个空值,我猜对应我上面问题1

    中描述的值
  • 问题 3) 重组我的代码以在同一行中输出三个值的更好方法是什么,而不是为每个值打印每个 MP 三次'米刮

from scrapy import Spider
from scrapy.http import Request

import logging


class MvSpider(Spider):
    name = 'mv2'
    allowed_domains = ['tbmm.gov.tr']
    start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']

    def parse(self, response):
        mv_list =  mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed

        for mv in mv_list:
            name = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
            party = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
            partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
            full_link = response.urljoin(partial_link)

            yield Request(full_link, callback = self.mv_analysis, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })


    def mv_analysis(self, response):
        name = response.meta.get('name')
        party = response.meta.get('party')

        billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
        billprop_link = response.urljoin(billprop_link_path)

        questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
        questionprop_link = response.urljoin(questionprop_link_path)

        speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
        speech_link = response.urljoin(speech_link_path)

        yield Request(billprop_link, callback = self.bill_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of bill proposals to be requested

        yield Request(questionprop_link, callback = self.quest_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        }) #number of question propoesals to be requested


        yield Request(speech_link, callback = self.speech_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of speeches to be requested




# COUNTING FUNCTIONS


    def bill_prop_counter(self,response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        billproposals = response.xpath("//tr[@valign='TOP']")

        yield  { 'bill_prop_count': len(billproposals),
                'name': name,
                'party': party}

    def quest_prop_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        researchproposals = response.xpath("//tr[@valign='TOP']")

        yield {'res_prop_count': len(researchproposals),
               'name': name,
               'party': party}

    def speech_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        speeches = response.xpath("//tr[@valign='TOP']")

        yield { 'speech_count' : len(speeches),
               'name': name,
               'party': party}

发生这种情况是因为您生成的是字典而不是项目对象,因此蜘蛛引擎不会有您想要的默认字段指南。

为了使 csv 输出字段 bill_prop_countres_prop_count,您应该在代码中进行以下更改:

1 - 创建一个包含所有所需字段的基础项目对象 - 您可以在 scrapy 项目的 items.py 文件中创建它:

from scrapy import Item, Field


class MvItem(Item):
    name = Field()
    party = Field()
    bill_prop_count = Field()
    res_prop_count = Field()
    speech_count = Field()

2 - 将创建的项目对象导入蜘蛛代码并生成用字典填充的项目,而不是单个字典:

from your_project.items import MvItem

...

# COUNTING FUNCTIONS
def bill_prop_counter(self,response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    billproposals = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{ 'bill_prop_count': len(billproposals),
            'name': name,
            'party': party})

def quest_prop_counter(self, response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    researchproposals = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{'res_prop_count': len(researchproposals),
           'name': name,
           'party': party})

def speech_counter(self, response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    speeches = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{ 'speech_count' : len(speeches),
           'name': name,
           'party': party})

输出 csv 将包含项目的所有可能列:

bill_prop_count,name,party,res_prop_count,speech_count
,Abdullah DOĞRU,AK Parti,,11
,Mehmet Şükrü ERDİNÇ,AK Parti,,3
,Muharrem VARLI,MHP,,13
,Muharrem VARLI,MHP,0,
,Jülide SARIEROĞLU,AK Parti,,3
,İbrahim Halil FIRAT,AK Parti,,7
20,Burhanettin BULUT,CHP,,
,Ünal DEMİRTAŞ,CHP,,22
...

现在,如果您想将所有三个计数都放在同一行中,则必须更改蜘蛛网的设计。在传递 meta 属性中的项目时可能有一个计数函数。