如何使用 scrapy 抓取动态搜索结果页面？

Question

我正在尝试从网站 https://howlongtobeat.com/#search 抓取结果。但是，当我抓取时，20 个中只有前 6 个结果。

我的代码：

import scrapy



cards =  response.css('div[class="search_list_details"]')

for card in cards: 
    game_name = card.css('a[class=text_white]::attr(title)').get()
    print(game_name)

输出：

'Elden Ring'
'Cyberpunk 2077'
'Kirby and the Forgotten Land'
'LEGO Star Wars The Skywalker Saga'
'Tomb Raider'
'Hollow Knight'
'Eiyuden Chronicle Rising' #This is not displayed on the page
'This War of Mine' #This is also not displayed on the page

我尝试使用其他卡片选择器，例如 response.css('li[class=back_darkish]')，但无济于事。

此外，我如何获取其他数据，例如要打败的小时数，以便我得到名称、完成类型和小时数的字典？:

<div>
    <div class="search_list_tidbit text_white shadow_text">Main Story</div>
    <div class="search_list_tidbit center time_100">50½ Hours </div>
    <div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
    <div class="search_list_tidbit center time_100">94 Hours </div>
    <div class="search_list_tidbit text_white shadow_text">Completionist</div>
    <div class="search_list_tidbit center time_100">127 Hours </div>
</div>

Answer 1

实际上，数据是从外部 url 生成的，即 API 调用 HTML 响应作为 POST 方法。

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        url = 'https://howlongtobeat.com/search_results?page=1'
        payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
        headers = {
            "content-type":"application/x-www-form-urlencoded",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }

        yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

    def parse(self, response):
        cards = response.css('div[class="search_list_details"]')

        for card in cards: 
            game_name = card.css('a[class=text_white]::attr(title)').get()
            yield {
                "game_name":game_name
            }
           

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

输出：

{'game_name': 'Elden Ring'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Cyberpunk 2077'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Kirby and the Forgotten Land'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'LEGO Star Wars The Skywalker Saga'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hollow Knight'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Tomb Raider'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hades'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'The Witcher 3 Wild Hunt'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Red Dead Redemption 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Forbidden West'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Trek to Yomi'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Grand Theft Auto V'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'God of War'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Marvels Guardians of the Galaxy'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock Infinite'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Pokmon Legends Arceus'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Zero Dawn  Complete Edition'}
2022-05-12 13:37:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-12 13:37:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 2754,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.49537,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 5, 12, 7, 37, 12, 172047),
 'httpcompression/response_bytes': 23986,
 'httpcompression/response_count': 1,
 'item_scraped_count': 20,

如何使用 scrapy 抓取动态搜索结果页面？

How do I scrape dynamic search results page with scrapy?

python

scrapy