如何使用 scrapy 抓取动态搜索结果页面?
How do I scrape dynamic search results page with scrapy?
我正在尝试从网站 https://howlongtobeat.com/#search 抓取结果。但是,当我抓取时,20 个中只有前 6 个结果。
我的代码:
import scrapy
cards = response.css('div[class="search_list_details"]')
for card in cards:
game_name = card.css('a[class=text_white]::attr(title)').get()
print(game_name)
输出:
'Elden Ring'
'Cyberpunk 2077'
'Kirby and the Forgotten Land'
'LEGO Star Wars The Skywalker Saga'
'Tomb Raider'
'Hollow Knight'
'Eiyuden Chronicle Rising' #This is not displayed on the page
'This War of Mine' #This is also not displayed on the page
我尝试使用其他卡片选择器,例如 response.css('li[class=back_darkish]')
,但无济于事。
此外,我如何获取其他数据,例如要打败的小时数,以便我得到名称、完成类型和小时数的字典?:
<div>
<div class="search_list_tidbit text_white shadow_text">Main Story</div>
<div class="search_list_tidbit center time_100">50½ Hours </div>
<div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
<div class="search_list_tidbit center time_100">94 Hours </div>
<div class="search_list_tidbit text_white shadow_text">Completionist</div>
<div class="search_list_tidbit center time_100">127 Hours </div>
</div>
实际上,数据是从外部 url 生成的,即 API 调用 HTML 响应作为 POST
方法。
import scrapy
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
url = 'https://howlongtobeat.com/search_results?page=1'
payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
headers = {
"content-type":"application/x-www-form-urlencoded",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}
yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)
def parse(self, response):
cards = response.css('div[class="search_list_details"]')
for card in cards:
game_name = card.css('a[class=text_white]::attr(title)').get()
yield {
"game_name":game_name
}
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(TestSpider)
process.start()
输出:
{'game_name': 'Elden Ring'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Cyberpunk 2077'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Kirby and the Forgotten Land'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'LEGO Star Wars The Skywalker Saga'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hollow Knight'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Tomb Raider'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hades'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'The Witcher 3 Wild Hunt'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Red Dead Redemption 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Forbidden West'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Trek to Yomi'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Grand Theft Auto V'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'God of War'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Marvels Guardians of the Galaxy'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock Infinite'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Pokmon Legends Arceus'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Zero Dawn Complete Edition'}
2022-05-12 13:37:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-12 13:37:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 2754,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.49537,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 5, 12, 7, 37, 12, 172047),
'httpcompression/response_bytes': 23986,
'httpcompression/response_count': 1,
'item_scraped_count': 20,
我正在尝试从网站 https://howlongtobeat.com/#search 抓取结果。但是,当我抓取时,20 个中只有前 6 个结果。
我的代码:
import scrapy
cards = response.css('div[class="search_list_details"]')
for card in cards:
game_name = card.css('a[class=text_white]::attr(title)').get()
print(game_name)
输出:
'Elden Ring'
'Cyberpunk 2077'
'Kirby and the Forgotten Land'
'LEGO Star Wars The Skywalker Saga'
'Tomb Raider'
'Hollow Knight'
'Eiyuden Chronicle Rising' #This is not displayed on the page
'This War of Mine' #This is also not displayed on the page
我尝试使用其他卡片选择器,例如 response.css('li[class=back_darkish]')
,但无济于事。
此外,我如何获取其他数据,例如要打败的小时数,以便我得到名称、完成类型和小时数的字典?:
<div>
<div class="search_list_tidbit text_white shadow_text">Main Story</div>
<div class="search_list_tidbit center time_100">50½ Hours </div>
<div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
<div class="search_list_tidbit center time_100">94 Hours </div>
<div class="search_list_tidbit text_white shadow_text">Completionist</div>
<div class="search_list_tidbit center time_100">127 Hours </div>
</div>
实际上,数据是从外部 url 生成的,即 API 调用 HTML 响应作为 POST
方法。
import scrapy
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
url = 'https://howlongtobeat.com/search_results?page=1'
payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
headers = {
"content-type":"application/x-www-form-urlencoded",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}
yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)
def parse(self, response):
cards = response.css('div[class="search_list_details"]')
for card in cards:
game_name = card.css('a[class=text_white]::attr(title)').get()
yield {
"game_name":game_name
}
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(TestSpider)
process.start()
输出:
{'game_name': 'Elden Ring'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Cyberpunk 2077'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Kirby and the Forgotten Land'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'LEGO Star Wars The Skywalker Saga'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hollow Knight'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Tomb Raider'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hades'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'The Witcher 3 Wild Hunt'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Red Dead Redemption 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Forbidden West'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Trek to Yomi'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Grand Theft Auto V'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'God of War'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Marvels Guardians of the Galaxy'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock Infinite'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Pokmon Legends Arceus'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Zero Dawn Complete Edition'}
2022-05-12 13:37:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-12 13:37:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 2754,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.49537,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 5, 12, 7, 37, 12, 172047),
'httpcompression/response_bytes': 23986,
'httpcompression/response_count': 1,
'item_scraped_count': 20,