等待 Scapy 回调函数

Wait for Scapy callback function

我是 Scrapy 的新手,Python 一般。

代码如下:



import scrapy
import json

class MOOCSpider(scrapy.Spider):
    name = 'mooc'
    start_urls = ['https://www.plurk.com/search?q=italy']
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
    }
    global_id = 1458122036

    def parse(self, response):
        

        url = 'https://www.plurk.com/Search/search2'

        headers = {
         ...omitted...
          }


        for i in range(1,10):
            formdata = {
            "after_id": str(self.global_id)
            }
            yield scrapy.FormRequest(url, callback=self.parse_api, formdata=formdata, headers=headers)


    def parse_api(self, response):
        raw = response.body
        data = json.loads(raw)
        posts = data["plurks"]
        users = data["users"]


        l = len(posts)
        i = 0
        for post in posts:
            i = i + 1
            if (i == l):
                self.global_id = post["plurk_id"]
            
            ...omitted code...
            
            yield {
                'Author': user_name,
                'Body': post['content'],
                'app': 'plurk'
            }



我遇到的问题是 Scrapy 首先在 for 循环中发出所有请求,然后在 parse_api 中执行代码。 我想做的是让 scrapy 做一次 for 循环迭代,调用回调函数,等待它 return 然后再做另一次迭代。

这是因为下一次请求所需的 ID 将由回调函数设置在 global_id 变量中。

您不能通过循环调度请求来实现此目的。
只有在每次 parse/parse_api 方法调用仅安排一个(下一个)请求时,您才能实现此功能:

class MOOCSpider(scrapy.Spider):
    name = 'mooc'
    start_urls = ['https://www.plurk.com/search?q=italy']
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
        'DOWNLOAD_DELAY':5,
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
    }

    def parse(self, response):
        # schedule only first request (withour loop)
        formdata = {
            "query": 'italy',
            "start_date": "2019/12",
            "end_date": "2020/12",
            "after_id": '1458122036', #<- your initial global_id
        }
        yield scrapy.FormRequest('https://www.plurk.com/Search/search2', callback=self.parse_api, formdata=formdata)

    def parse_api(self, response):
        data = json.loads(response.body)
        after_id = None
        for post in data["plurks"]:
            after_id = post["plurk_id"]
            yield {
                'Author': data["users"][str(post["owner_id"])]["nick_name"],  #  instead of user_id?
                'Body': post["content"],
                'app': 'plurk'
            }
        # after end of this loop - after_id should contain required data for next request

        # instead of separate loop variable response.meta["depth"] used to limit number requests
        if response.meta["depth"] <=11 and after_id:  # schedule next request
            formdata = {
                "query": 'italy',
                "start_date": "2019/12",
                "end_date": "2020/12",
                "after_id": str(after_id),
            }
            yield scrapy.FormRequest('https://www.plurk.com/Search/search2', callback=self.parse_api, formdata=formdata)

回答我自己的问题:

现在 parse 方法只执行一个请求并调用一次 parse_api 方法。 Parse_api 处理响应并设置 global_id 变量。一旦处理完自己的响应,它就会发出另一个请求,将自己作为回调函数传递。 通过这样做,您可以保证 global_id 变量将被正确设置,因为只有在 parse_api 完成 运行.

后才会发出新请求

request.cb_kwargs["loop_l"] 用于将附加参数传递给回调函数。这次是一个计数器,它控制我们要发出的请求的数量。当计数器等于 100 时我们停止爬行

import scrapy
import json

plurk_id = []
class MOOCSpider(scrapy.Spider):
    name = 'mooc'
    start_urls = ['https://www.plurk.com/search?q=']
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
    }
    global_id = 1455890167

    url = 'https://www.plurk.com/Search/search2'

    headers = {
       ...OMITTED...
          }


    def parse(self, response):
        
        formdata = {
            "after_id": str(self.global_id)
        }
        request = scrapy.FormRequest(self.url, callback=self.parse_api, formdata=formdata, headers=self.headers)
        request.cb_kwargs["loop_l"] = str(0)
        yield request

    def parse_api(self, response, loop_l):
        int_loop_l = int(loop_l)
        int_loop_l = int_loop_l + 1
     
        if (int_loop_l == 200):
            return
        raw = response.body
        data = json.loads(raw)

       ...omitted code...
       ... GET AND SET THE NEW global_id FROM THE RESPONSE ...
        
        # make another request with the new id
        formdata = {
            "after_id": str(self.global_id)
            }
        request = scrapy.FormRequest(self.url, callback=self.parse_api, formdata=formdata, headers=self.headers)
        request.cb_kwargs["loop_l"] = str(int_loop_l)
        yield request