使用 python 请求和 google 搜索

Using python requests with google search

我是 python 的新手。 在 PyCharm 我写了这段代码:

import requests
from bs4 import BeautifulSoup

response = requests.get(f"https://www.google.com/search?q=fitness+wear")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

而不是得到搜索结果的HTML,我得到的是下一页的HTML

我在 pythonanywhere.com 上的脚本中使用了相同的代码,它运行良好。我尝试了很多我找到的解决方案,但结果总是一样,所以现在我坚持使用它。

我认为这应该可行:

import requests
from bs4 import BeautifulSoup

with requests.Session() as s:
    url = f"https://www.google.com/search?q=fitness+wear"
    headers = {
        "referer":"referer: https://www.google.com/",
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
        }
    s.post(url, headers=headers)
    response = s.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

它使用一个请求会话和一个 post 请求来创建任何初始 cookie(对此不完全确定),然后允许您抓取。

如果您在浏览器中打开私人 Window 并转到 google.com,您应该会看到相同的弹出窗口,提示您同意。这是因为您没有发送会话 cookie。

你有不同的选择来解决这个问题。 一种是直接发送您可以在网站上观察到的 cookie,如下所示:

import requests
cookies = {"CONSENT":"YES+shp.gws-20210330-0-RC1.de+FX+412", ...}

resp = request.get(f"https://www.google.com/search?q=fitness+wear",cookies=cookies)

@Dimitriy Kruglikov 使用的解决方案更简洁,使用会话是与网站进行持久会话的好方法。

Google 不会阻止你,你仍然可以从 HTML.

中提取数据

使用 cookie 不是很方便,使用 session 和 post 并获取请求会导致更大的流量。

您可以使用 decompose()extract() BS4 方法删除此弹出窗口:

  • annoying_popup.decompose() 将彻底摧毁它及其内容。 Documentation.

  • annoying_popup.extract() 将生成另一棵 html 树:一棵植根于您用来解析文档的 BeautifulSoup object,另一棵植根于提取的标签。 Documentation.

在那之后,你可以抓取你需要的所有东西,也可以不删除它。

看到这个 Organic Results extraction 我最近做过。它从 Google 搜索结果中抓取标题、摘要和 link。


或者,您可以使用 Google Search Engine Results API from SerpApi. Check out the Playground

代码和example in online IDE:

from serpapi import GoogleSearch
import os

params = {
  "engine": "google",
  "q": "fus ro dah",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  print(f"Title: {result['title']}\nSnippet: {result['snippet']}\nLink: {result['link']}\n")

输出:

Title: Skyrim - FUS RO DAH (Dovahkiin) HD - YouTube
Snippet: I looked around for a fan made track that included Fus Ro Dah, but the ones that I found were pretty bad - some ...
Link: https://www.youtube.com/watch?v=JblD-FN3tgs

Title: Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Snippet: If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: "Fus Rah Do" instead of the proper "Fus Ro Dah." ...
Link: https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)

Title: Fus Ro Dah | Know Your Meme
Snippet: Origin. "Fus Ro Dah" are the words for the "unrelenting force" thu'um shout in the game Elder Scrolls V: Skyrim. After reaching the first town of ...
Link: https://knowyourmeme.com/memes/fus-ro-dah

Title: Fus ro dah - Urban Dictionary
Snippet: 1. A dragon shout used in The Elder Scrolls V: Skyrim. 2.An international term for oral sex given by a female. ex.1. The Dragonborn yelled "Fus ...
Link: https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah

JSON的一部分:

"organic_results": [
  {
    "position": 1,
    "title": "Unrelenting Force (Skyrim) | Elder Scrolls | Fandom",
    "link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)",
    "displayed_link": "https://elderscrolls.fandom.com › wiki › Unrelenting_F...",
    "snippet": "If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: \"Fus Rah Do\" instead of the proper \"Fus Ro Dah.\" ...",
    "sitelinks": {
      "inline": [
        {
          "title": "Location",
          "link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Location"
        },
        {
          "title": "Effect",
          "link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Effect"
        },
        {
          "title": "Usage",
          "link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Usage"
        },
        {
          "title": "Word Wall",
          "link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Word_Wall"
        }
      ]
    },
    "cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:K3LEBjvPps0J:https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)+&cd=17&hl=en&ct=clnk&gl=us"
  }
]

Disclaimer, I work for SerpApi.