如何使 scrapy 请求同步

Question

我最近开始使用Scrapy和Python，所以请多多包涵。我的代码基于此 tutorial。我需要从这个 website 获取不同年份我国（巴西）所有城市的一些信息。下拉列表的选项是使用 AJAX 请求动态生成的。因此，首先我得到所有的年份和州，然后我请求从每个州获取城市。

我了解到，如果我在循环中使用 return，就像在代码中一样，它将完成我的功能，问题是，如果我使用 YIELD，请求将不遵循任何顺序（可能是因为请求是异步的？也让我知道原因），即它向状态错误的城市发出请求。因此，我收到了错误的回复。顺便说一句，虽然使用 return 它完成了功能，但它发出了正确的请求。

def parse(self, response):
        years = response.xpath(...).getall()
        states = response.xpath(...).getall()`

        # Start from the second element since the first one is '-- Select --'
        for year in years[1:]:
            for state in states[1:]:
                print (year)
                print (state)        
                # I need this request to get all cities from the current state, since it's generated with an AJAX REQUEST
                request = { ...,
                  callback = self.parse_city
                return request

    def parse_city(self, response):
        keys = response.xpath(...).getall()
        values = response.xpath(...).getall()

        # Build dictionary with the key (city IBGE code) and value (city name)
        cities = dict(zip(keys[1:], values[1:]))

        for code, city in cities.items():
            request = ...,
              callback = self.parse_result
            return request

    def parse_result(self, response):
        yield {
           #The information that I want
        }

我的请求是在循环内创建的，我希望发生的是：首先：打印年份和州，然后提出请求。第二：回调将抓住所有城市并为此提出请求城市，在那一年的那个州。第三：parse_result 会获取我需要的信息。相反，它会打印所有年份和状态，这意味着它不会同步执行

如何同步？我如何确保我的请求将遵循我的数组的正确顺序？

非常感谢

绑定更清楚：

for each year   
    select year   
    for each state
        select state
        wait for cities options to load
        for each city
            get the information

Answer 1

如果我没理解错的话，这里的问题是因为session信息是保存在有状态服务器中的。对吗？

处理此问题的一种方法是为每个状态设置一个会话，并通过 cookiejars 进行管理。例如：

for year in years[1:]:
    for state in states[1:]:
        yield Request(
            # ...,
            callback=self.parse_city, 
            meta={'cookiejar': state} 
        )

有关 cookiejars 的更多信息。

是的，scrapy 中的请求被异步安排到运行。这就是为什么我们应该为其提供回调函数。

如何使 scrapy 请求同步

How to make scrapy requests synchronous

loops

request

synchronous

scrapy

python-3.x