如何使用 scrapy 从多个 url 将数据收集到单个项目中 python

How to collect data into single item from multiple urls with scrapy python

简单来说,我想从回调函数中获取 return 值,直到 for 循环耗尽,然后是 yield 单项。

我想做的是跟随,
我正在创建新的 links,代表点击 https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/ 上的选项卡 比如

  1. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#ah;2

  2. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#over-under;2 等等。 但它们基本上是同一场比赛的数据,所以我试图将投注信息收集到一次。

基本上,我使用带有 dict 的 for 循环来创建一个新的 link 并产生带有回调函数的请求。

class CountryLinksSpider(scrapy.Spider):
    name = 'country_links'
    allowed_domains = ['oddsportal.com']
    start_urls = ['https://www.oddsportal.com/soccer/africa/caf-champions-league/es-setif-al-ahly-AsWAHRrD/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.create_all_tabs_links_from_url)

    def create_all_tabs_links_from_url(self, response):
        current_url = response.request.url
        _other_useful_scrape_data_dict = OrderedDict(
            [('time', '19:00'), ('day', '14'), ('month', 'May'), ('year', '22'), ('Country', 'Africa'),
             ('League', 'CAF Champions'), ('Home', 'ES Setif'), ('Away', 'Al Ahly'), ('FT1', '2'), ('FT2', '2'),
             ('FT', 'FT'), ('1H H', '1'), ('1H A', '1'), ('1HHA', 'D'), ('2H H', '1'), ('2H A', 1), ('2HHA', 'D')])

        with requests.Session() as s:
            s.headers = {
                "accept": "*/*",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "en-US,en;q=0.9,pl;q=0.8",
                "referer": 'https://www.oddsportal.com',
                "user-agent": fake_useragent.UserAgent().random
            }
            r = s.get(current_url)
            version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
            sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
            xeid = re.search(r'"id":"(.*?)"', r.text).group(1)

            xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))

        unix = int(round(time.time() * 1000))

        tabs_dict = {'#ah;2': ['5-2', 'AH full time', ['1', '2']], '#ah;3': ['5-3', 'AH 1st half', ['1', '2']],
                     '#ah;4': ['5-4', 'AH 2nd half', ['1', '2']], '#dnb;2': ['6-2', 'DNB full_time', ['1', '2']]}
        all_tabs_data = OrderedDict()
        all_tabs_data = all_tabs_data | _other_useful_scrape_data_dict

        for key, value in tabs_dict.items():
            api_url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{value[0]}-{xhash}.dat?_={unix}'

            # goto each main tabs and get data from it and yield here
            single_tab_scrape_data = yield scrapy.http.Request(api_url,
                                                        callback=self.scrape_single_tab)
        # following i want to do (collect all the data from all tabs into single item)
        # all_tabs_data = all_tabs_data | single_tab_scrape_data # (as a dict)

    # yield all_tabs_data  # yield single dict with scrape data from all the tabs

    def scrape_single_tab(self, response):
        # sample scraped data from the response
        scrape_dict = OrderedDict([('AH full time -0.25 closing 2', 1.59), ('AH full time -0.25 closing 1', 2.3),
                                   ('AH full time -0.25 opening 2', 1.69), ('AH full time -0.25 opening 1', 2.12),
                                   ('AH full time -0.50 opening 1', ''), ('AH full time -0.50 opening 2', '')])

        yield scrape_dict

我试过了, 首先,我尝试从 scrape_match_data 函数中简单地 returning 抓取项目。但是我找不到从 yield 请求中获取回调函数的 return 值的方法。

我试过使用以下库, 从 inline_requests 导入 inline_requests 从 twisted.internet.defer 导入 inlineCallbacks

但我无法让它工作。我觉得必须有更简单的方法来将来自不同 link 的抓取项目附加到一个项目中并产生它。

请帮我解决这个问题。

从技术上讲,在 scrapy 中,我们有 2 种方法在我们用于从多个请求构造项目的回调函数之间传输数据:

1.请求元字典:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        meta = {'scraped_item_data': data})

def parse_details(self, response):
    scraped_data = response.meta.get('scraped_item_data') # <- not present in Your code
    ...

可能您错过了调用 response.meta.get('_scrape_dict') 来访问从之前的回调函数中抓取的数据

2。 cb_kwargs 可用于 scrapy 1.7 及更新版本:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        cb_kwargs={'scraped_item_data': data})

def parse_details(self, response, scraped_item_data):  # <- already accessible data from previous request
    ...

3.Single 来自多个相同类型的回复的项目。
实现它的最简单方法是将数据分配给 class 变量。 代码将如下所示:

def parse(self, response):
    self.tabs_data = []
    ...
    self.tabs_number = len(tabs) #  or len(list(tabs)) # <number of tabs
    for tab in tabs:
        yield Request(...

def parse_details(self, response)
    scraped_tab_data = ...
    self.tabs_data.append(scraped_tab_data)
    if len(self.tabs_data) == self.tabs_number: # when data from all tabs collected
        # compose one big item
        ...
        yield item