如何在 python 中获取 'yield' 的结果？

Question

也许 Python 中的 yield 对某些人来说是补救措施，但对我来说不是……至少现在还不是。我了解 yield 创建 'generator'.

当我决定学习 scrapy 时，我偶然发现了 yield。我为 Spider 编写了一些代码，其工作方式如下：

转到起始超链接并提取所有超链接 - 这不是完整的超链接，只是连接到起始超链接的子目录
检查超链接将满足特定条件的超链接附加到基本超链接
使用请求导航到新的超链接并解析以在具有 'onclick'

import scrapy

class newSpider(scrapy.Spider)
    name = 'new'
    allowed_domains = ['www.alloweddomain.com']
    start_urls = ['https://www.alloweddomain.com']

    def parse(self, response)
        links = response.xpath('//a/@href').extract()
        for link in links:
            if link == 'SpecificCriteria':
                next_link = response.urljoin(link)
                yield Request(next_link, callback=self.parse_new)

编辑 1：

                for uid_dict in self.parse_new(response):
                   print(uid_dict['uid'])
                   break

结束编辑 1

运行此处的代码将 response 评估为对 start_urls 而不是 next_link.

的 HTTP 响应

    def parse_new(self, response)
        trs = response.xpath("//*[@class='unit-directory-row']").getall()
        for tr in trs:
            if 'SpecificText' in tr:
                elements = tr.split()
                for element in elements:
                    if 'onclick' in element:
                        subelement = element.split('(')[1]
                        uid = subelement.split(')')[0]
                        print(uid)
                        yield {
                            'uid': uid
                        }
                break

有效，scrapy 抓取第一页，创建新的超链接并导航到下一页。 new_parser 解析 HTML 的 uid 和 'yields' 它。 scrapy 的引擎显示正确的 uid 是 'yielded'.

我不明白的是我如何 'use' parse_new 获得的 uid 创建并导航到一个新的超链接，就像我将变量一样，但我似乎无法return 带有 Request.

的变量

Answer 1

我会查看 What does the "yield" keyword do? 以了解 yield 的工作原理。

同时，spider.parse_new(response)是一个可迭代对象。也就是说，您可以通过 for 循环获取其产生的结果。例如，

for uid_dict in spider.parse_new(response):
    print(uid_dict['uid'])

Answer 2

经过大量阅读和学习，我发现scrapy在第一次解析时不执行回调的原因与yield无关！跟两个问题有很大关系：

1) robots.txt。 Link 可以是 'resolved' with ROBOTSTXT_OBEY = False in settings.py

2) 记录器有 Filtered offsite request to。 Link dont_filter=True 可能会解决这个问题。

如何在 python 中获取 'yield' 的结果？

How do I obtain results from 'yield' in python?

python

yield

scrapy