Scrapy

Question

所以我已经坚持了几天。我正在解析一个包含一些信息的 JSON 对象。在这个对象中有一个包含 n 个联系人的列表。其中每一个都有一个可用于创建 url 的 ID。在 url 中有该联系人的电话号码。

所以我想开始创建一个项目，添加一些信息，然后遍历联系人，对于每个循环，我想添加到原始项目中，并在 url 中找到电话号码。

我的问题：我如何 return 抓取的电话号码并将其添加到项目中？如果我以 "yield items" 结束主要解析方法，则循环中抓取的数据的 none 将添加到项目中。但是，如果我改为使用 "yield items" 结束 parseContact，则每个循环都会复制整个项目。

请帮忙，我快崩溃了:D

代码如下：

def parse(self, response):

        items = projectItem()
        rData = response.xpath('//*[@id="data"]/text()').get()
        dData = json.loads(rData)
        listOfContacts = dData["contacts"]
        Data = dData["customer"]

        items['customername'] = Data["companyName"]
        items['vatnumber'] = Data["vatNo"]
        items['contacts'] = []


        i=0
        for p in listOfContacts:
            id = json.dumps(p["key"])
            pid = id.replace("\"","")
            urlP = urljoin("https://example.com/?contactid=", pid)
            items['contacts'].append({"pid":pid,"name":p["name"]})

            yield scrapy.Request(urlP, callback=self.parseContact,dont_filter=True,cb_kwargs={'items':items},meta={"counter":i})
            i +=1
        #IF I YIELD HERE, NONE OF THE DATA IN THE LOOP GETS SAVED    
        yield items 




    def parseContact(self, response,items):
        i = response.meta['counter']

        data = response.xpath('//*[@id="contactData"]/script/text()').get()
        items['contacts'][i].update({"data":data})
        #IF I YIELD HERE THE ITEM iS DUPLICATED N TIMES
        yield items

Answer 1

如果你想要每个公司 1 个项目，你需要在产生它之前完全构建项目。我会这样做：

import json
import scrapy
from urllib.parse import urljoin

def parse(self, response):
    items = projectItem()
    rData = response.xpath('//*[@id="data"]/text()').get()
    dData = json.loads(rData)
    listOfContacts = dData["contacts"]
    Data = dData["customer"]
    items['contacts'] = []

    items['customername'] = Data["companyName"]
    items['vatnumber'] = Data["vatNo"]
    contacts_info = []
    # prepare list with the contact urls, pid & name
    for p in listOfContacts:
        id = json.dumps(p["key"])
        pid = id.replace("\"", "")
        urlP = urljoin("https://example.com/?contactid=", pid)
        contacts_info.append((urlP, pid, p["name"]))
    # get the first item from the list, and pass the rest of the list along in the meta
    urlP, pid, name = contacts_info.pop()
    yield scrapy.Request(urlP,
                         callback=self.parseContact,
                         dont_filter=True,
                         meta={"contacts_info": contacts_info,
                               "items": items})

def parseContact(self, response, items):
    contacts_info = response.meta['contacts_info']
    # get count from meta, or default to 0
    count = response.meta.count('count', 0)
    count += 1
    items = response.meta['items']
    data = response.xpath('//*[@id="contactData"]/script/text()').get()
    items['contacts'][count].update({"data": data})
    try:
        urlP, pid, name = contacts_info.pop()
    except IndexError:
        # list contacts info is empty, so the item is finished and can be yielded
        yield items
    else:
        yield scrapy.Request(urlP,
                             callback=self.parseContact,
                             dont_filter=True,
                             meta={"contacts_info": contacts_info,
                                   "items": items,
                                   "count": count})

我不确定 pid 和计数器之间的 link 是什么（因此需要在代码中添加关于添加 pid 和名称的部分），但我希望你在这里明白了。

Scrapy - 我如何 return 数据从产生的请求到主要解析方法？

Scrapy - How do i return data to main parse method from yielded request?

python

yield