无法获得包含 Python 个请求的整个 HTML 页面

Unable to get entire HTML page with Python requests

我在 Cards Against Humanity 游戏卡片编辑器工作。为了获得卡片创意,我希望以编程方式从以下 web page 下载整副牌。 使用检查工具,我发现了存储卡的位置:

可以看出,在白卡class和黑卡class里面,每一张卡的id都可以找到,卡的短语或者想法写在里面。

我的代码的一般功能是提供一副牌 URL 并获得所有卡片示例(白色和黑色)。 我的第一个方法是使用 Python 中的 Requests 包。我使用了以下代码:

import requests
from bs4 import BeautifulSoup

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

root = soup.find(id='root')

尽管如此,在检查根对象时我发现它是空的,但它应该包含所有的白卡和黑卡class。

通常情况下,网页在初始页面加载时并未完全加载。通常在页面加载后 JavaScript 代码执行一个或多个 AJAX 请求导致 DOM 被修改,这就是为什么使用 requests 获取页面不会产生最终,完成 DOM。因此,我在浏览器中加载了该页面,并查看了页面加载后发出的 XHR 网络请求。然而 none 似乎 return 缺少信息。所以这有点令人费解。因此,我的解决方案是使用 Selenium 驱动浏览器(下例中的 Chrome)并抓取页面。初始页面加载后需要等待一两秒以确保 DOM 完成:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.get(URL)
time.sleep(1) # wait a second for <div id="root"> to be fully loaded
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
root = soup.find(id='root')
print(root)

更新

我仔细查看了 AJAX 调用,看起来下面的 URL 将 return 您感兴趣的实际数据:

https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get
import requests


URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get'
resp = requests.get(URL)
print(resp.json())

打印:

{'success': True, 'expansion': {'_id': '5e758e4034489b003f4529f6', 'name': 'Global Pandemic Pack', 'author': '5dfde1f4897a0f003e2fb547', 'description': "Who says in-house quarantine has to suck? For the price of a handful of toilet paper rolls, you can gain some original pandemic-themed cards that'll surely spice up your card games. Get your hands on the first-ever official Cards Lacking Originality card pack now! I mean it, right now!", 'price': 0, 'published': True, 'featured': True, 'dateCreated': '2020-03-21T03:47:12.167Z', '__v': 0, 'gamesUsed': 655, 'whiteCards': [',200 Trump bucks.', 'A free extra week on the cruise ship!', 'A long Zoom meeting with no obvious purpose.', 'A lukewarm bowl of bat soup.', 'A mass panic caused by a sneeze.', 'Babies concieved under quarantine.', 'Beautiful cross-cultural friendships.', 'Binging 30 straight seasons of "The Simpsons."', 'Burying your head in a screen to escape family time.', 'Costco: Battle Royale.', 'Craving any excuse to party.', 'Crying and then sleeping and then crying.', 'Eating all the quarantine food within a day.', 'Ejaculating into the air and trying to catch it in your mouth.', 'Exchanging blowjobs for Kleenex and toilet paper.', 'Forgetting what genuine human connection feels like.', 'Groupons at funeral homes.', 'Hating the media.', 'Insatiable horniness.', 'Kung Flu fighting.', "My Gram-Gram's loooooong vacation!", 'Online class shootings.', 'Only washing hands after the CDC says you have to.', 'Plague, Inc.', 'Praying for the sweet release of death.', 'Raging Ebola.', 'Rediscovering the wonders of video games.', 'Some Lyme disease to go with your Coronavirus.', 'The National Guard.', 'The other eighteen COVIDs.', 'Unnecessarily sensual Zoom messages.'], 'blackCards': ['America: #1 in _______!', "Doctor, I've been doing _______ lately and I fear that I may be very sick.", 'I cannot BELIEVE that the grocery store is sold out of _______ already!', 'We regret to inform you that _______ has officially been cancelled due to COVID-19.', 'What is the one good thing about this pandemic?', 'What was the most difficult thing to give up for social distancing?', "What's really to blame for the spread of the virus?", "What's the best way to kill time while trapped inside the house?", "_______ is the entire reason I'm still holding onto some sanity."]}}