抓取 Google 搜索结果时为空列表

Empty list while scraping Google Search Result

我正在尝试抓取 Google 搜索结果,但我得到的输出是空列表。你知道这里出了什么问题吗?我在 Stack Overflow 上发现了类似的 post,其中 solution 表示您应该尝试输入 user_agent。我试过了,但还是 returns 什么都没有。如果您有任何想法,请分享。

import requests, webbrowser
from bs4 import BeautifulSoup

user_input = input("Enter something to search:")
print("googling.....")

google_search = requests.get("https://www.google.com/search?q="+user_input)
# print(google_search.text)

soup = BeautifulSoup(google_search.text , 'html.parser')
# print(soup.prettify())

search_results = soup.select('.r a')
# print(search_results)

for link in search_results[:5]:
    actual_link = link.get('href')
    print(actual_link)
    webbrowser.open('https://google.com/'+actual_link)

Google 阻止您的请求并抛出此错误 当 Google 自动检测到来自您的计算机网络的请求似乎违反了 Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.

This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn more

如果您使用的是机器人已知使用的高级术语,或者发送请求速度非常快,有时您可能会被要求解决验证码问题。
.

尝试使用 selenium + python 获取所有链接

现在大多数网站都使用 JavaScript 来动态加载他们的网页。 Google 是这些网站之一。为了加载完整的 DOM(文档对象模型),您需要一个 Javascript 引擎,beautifulsoup 和请求没有。 Arun 推荐了 selenium,我也推荐了,因为它有一个嵌入式 Javascript 引擎。

这是 Python Selenium 文档: https://selenium-python.readthedocs.io/

要从 Google 页面获取结果,您必须指定 User-Agent http header。对于英文结果,添加 hl=en 参数来搜索 URL:

import requests
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

user_input = input("Enter something to search: ")
print("googling.....")

google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers)  # <-- add headers and hl=en parameter

soup = BeautifulSoup(google_search.text , 'html.parser')

search_results = soup.select('.r a')

for link in search_results:
    actual_link = link.get('href')
    print(actual_link)

打印:

Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:wHCoEH9G9w8J:https://en.wikipedia.org/wiki/Tree+&cd=22&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAVegQIAxAH
https://simple.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:tNzOpY417g8J:https://simple.wikipedia.org/wiki/Tree+&cd=23&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://simple.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAWegQIARAH
https://www.britannica.com/plant/tree
#
https://webcache.googleusercontent.com/search?q=cache:91hg5d2649QJ:https://www.britannica.com/plant/tree+&cd=24&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://www.britannica.com/plant/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAXegQIAhAJ
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
#
https://webcache.googleusercontent.com/search?q=cache:AVSszZLtPiQJ:https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree+&cd=25&hl=en&ct=clnk&gl=sk
https://teamtrees.org/
#
https://webcache.googleusercontent.com/search?q=cache:gVbpYoK7meUJ:https://teamtrees.org/+&cd=26&hl=en&ct=clnk&gl=sk
https://www.ldoceonline.com/dictionary/tree
#
https://webcache.googleusercontent.com/search?q=cache:oyS4e3WdMX8J:https://www.ldoceonline.com/dictionary/tree+&cd=27&hl=en&ct=clnk&gl=sk
https://en.wiktionary.org/wiki/tree
#
https://webcache.googleusercontent.com/search?q=cache:s_tZIjpvHZIJ:https://en.wiktionary.org/wiki/tree+&cd=28&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wiktionary.org/wiki/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAbegQICBAH
https://www.dictionary.com/browse/tree
#
https://webcache.googleusercontent.com/search?q=cache:EhFIP6m4MuIJ:https://www.dictionary.com/browse/tree+&cd=29&hl=en&ct=clnk&gl=sk
https://www.treepeople.org/tree-benefits
#
https://webcache.googleusercontent.com/search?q=cache:4wLYFp4zTuUJ:https://www.treepeople.org/tree-benefits+&cd=30&hl=en&ct=clnk&gl=sk

编辑:要过滤结果,您可以使用:

import requests
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

user_input = input("Enter something to search: ")
print("googling.....")

google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers)  # <-- add headers and hl=en parameter

soup = BeautifulSoup(google_search.text , 'html.parser')

search_results = soup.select('.r a')

for link in search_results:
    actual_link = link.get('href')
    if actual_link.startswith('#') or \
       actual_link.startswith('https://webcache.googleusercontent.com') or \
       actual_link.startswith('/search?'):
        continue
    print(actual_link)

打印(例如):

Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
https://simple.wikipedia.org/wiki/Tree
https://www.britannica.com/plant/tree
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
https://teamtrees.org/
https://www.ldoceonline.com/dictionary/tree
https://en.wiktionary.org/wiki/tree
https://www.dictionary.com/browse/tree
https://www.treepeople.org/tree-benefits

OP 所需的输出并非如 所述来自 JavaScript。 OP 需要的所有数据 位于 中 HTML。

出于同样的原因,在 selenium 中也没有任何意义,在通过 [=61= 呈现的 HTML 中,not 都存在].


其他人提到的问题之一是因为没有user-agent specified AND you possibly passed the wrong user-agent which leads to a completely different HTML that contains an error message or something similar. Check out what is your user-agent

通过 user-agent:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get(YOUR_URL, headers=headers)

您还可以通过在方括号中传递它们来获取属性:

element.get('href')
# is equivalent to
element['href']

代码与example in the online IDE (CSS selectors reference):

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah" # query
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# container with links and iterate over it
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']

-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.textualtees.com/products/fus-ro-dah-t-shirt
'''

或者,您可以使用 SerpApi 中的 Google Search Results API 来实现相同的目的。这是付费 API 和免费计划。

你的情况的不同之处在于你不需要弄清楚为什么或如何处理这样的问题,因为这部分 (extraction/scraping) 是已经为最终用户完成。所有需要做的只是迭代结构化 JSON 并得到你想要的。

代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro day",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

---------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.etsy.com/market/fus_ro_dah
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.textualtees.com/products/fus-ro-dah-t-shirt
https://tenor.com/search/fus-ro-dah-gifs
'''

P.S - 我有一个博客 post 更深入地介绍了如何抓取 Google Organic Search Results.

Disclaimer, I work for SerpApi.