抓取 Google 搜索页面时无输出

Question

我正在尝试从 Google 搜索结果中抓取蓝色突出显示的部分，如下所示：

当我使用检查元素时，它显示：span class="YhemCb"。我尝试过使用各种 soup.find 和 soup.find_all 命令，但我尝试过的一切都没有到目前为止的输出。我应该使用什么命令来抓取这部分内容？

Answer 1

Google 使用 javascript 来显示其大部分 Web 元素，因此不幸的是，使用 requests 和 BeautifulSoup 之类的东西是不够的。

相反，请使用 selenium！它本质上允许您使用代码控制浏览器。

首先，您需要导航到要抓取的 google 页面

google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)

然后，您必须等到评论页面在浏览器中加载。

这是使用 WebDriverWait 完成的：您必须指定需要出现在页面上的元素。 [data-attrid="kc:/local:one line summary"] span css select 或者允许我 select 关于酒店的评论信息。

timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)

最后，打印评分

print(review_element.get_attribute('innerHTML'))

这里是完整的代码，以备您尝试使用

import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)

# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)

# print the rating
print(review_element.get_attribute('innerHTML'))

注意 Google 是出了名的防御任何试图抓取它们的人。在最初的几次尝试中，您可能会成功，但最终您将不得不处理 Google 验证码。

要解决这个问题，我建议您使用搜索引擎抓取工具，例如 quickstart guide 来帮助您入门！

免责声明：我在 Oxylabs.io

工作

抓取 Google 搜索页面时无输出

No output while scraping Google search page

python

beautifulsoup